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Our  Distributed  Conference  System  (DCS)  is  a  real-time  system  that  is  designed  to 
support  distributed  conferences  and  distributed  applications  over  a  Wide  Area  Network 
(WAN).  In  order  to  operate  correctly  and  efficiently,  it  requires  servers  that  can  withstand 
multiple  failures  (are  highly  resilient)  and  can  process  requests  rapidly  (are  responsive). 

The  primary-backup  approach  is  preferable  to  the  state-machine  approach  for  building 
reliable  servers  for  DCS,  but  its  existing  protocols  can  tolerate  only  a  single  server  failure. 
For  security  reasons,  DCS  servers  must  be  started  manually,  so  it  is  easily  possible  for 
more  than  one  server  to  fail  before  an  administrator  responds  and  recovers  a  failed  server. 
We  therefore  developed  a  new  non-blocking  primary- backup  protocol  that  can  operate  in  a 
synchronous  environment  and  withstand  an  arbitrary  number  of  crash  failures. 

We  analyzed  our  ring  protocol  and  showed  that  it  is  valid,  supports  arbitrarily  high 
resilience,  is  non-blocking,  suffers  a  minimal  failover  delay  and  uses  an  optimal  number 
of  servers.  We  also  studied  its  average-case  behavior  and  found  it  to  scale  well  with  the 
resilience. 


We  then  developed  a  queuing  model  for  our  composite  server  built  using  the  ring  protocol 
and  consisting  of  a  primary  and  a  set  of  backups,  and  simplified  it  so  it  can  be  used  in 
analyzing  complex  systems  that  use  such  servers. 

We  found  that  by  employing  clerks,  we  can  hide  the  composite  nature  of  our  fault- 
tolerant  servers,  making  them  applicable  not  only  to  DCS,  but  also  to  many  similar  dis- 
tributed environments. 

We  have  fully  implemented  the  ring  protocol  on  a  network  of  Sun1  workstations,  as  part 
of  multi-threaded  servers  designed  to  run  in  different  Internet  domains.  The  DCS  project 
itself  is  in  progress  and  we  expect  it  to  be  operational  by  1997. 

While  our  protocol  is  unique  in  providing  reliable  service  that  can  withstand  multiple 
crash  failures  in  synchronous  systems,  it  still  needs  to  be  strengthened  in  some  areas  to 
operate  to  its  full  potential. 


1  Sun  is  a  registered  trademark  of  Sun  Microsystems  Inc. 


CHAPTER  1 
INTRODUCTION 

Our  Distributed  Conference  System  (DCS)  project  requires  fault-tolerant  services,  capa- 
ble of  withstanding  multiple  failures,  in  order  to  operate  correctly  and  efficiently.  Existing 
approaches  to  achieve  fault-tolerance  are  unsuitable  to  DCS  either  because  they  are  inap- 
propriate for  its  environment  or  cannot  tolerate  more  than  a  single  failure.  So,  we  developed 
a  new  protocol  and  used  it  to  build  reliable  servers  for  DCS  that  meet  these  requirements. 

1.1     Motivation 

The  DCS  project  has  been  the  principal  drive  behind  this  work.  A  conference  is  a  Com- 
puter Supported  Cooperative  Work  (CSCW)  tool  to  facilitate  collaborative  work  and  allow 
access  to  pooled  resources.  The  function  of  DCS  is  to  support  distributed  conferences,  dis- 
tributed applications  and  diverse  distributed  activities  over  a  Wide  Area  Network  (WAN). 
After  we  studied  its  operating  environment  and  analyzed  its  numerous  requirements,  we 
found  that  DCS  can  function  correctly  and  efficiently  only  if  we  provide  reliable  services. 
More  precisely,  we  needed  resilient  and  responsive  services.  This  work  focuses  on  providing 
these  services. 

1.2     Need  for  Reliable  Service 

The  environment  in  which  DCS  is  to  operate  and  its  required  functions  both  have 
influenced  our  decision  to  look  for  reliable  services.  We  discuss  these,  emphasizing  those 
aspects  that  necessitated  reliability. 


1.2.1  Environment 

DCS  is  meant  to  work  over  a  WAN  that  links  users  located  in  distinct  Local  Area 
Networks  (LANs).  Specifically,  we  expect  it  to  operate  over  the  Internet,  linking  users 
located  in  different  Internet  domains. 

•  Internet  communication  latency  is  very  high.  If  DCS  is  to  operate  efficiently,  it  has  to 
minimize  the  messages  across  the  Internet.  It  is  therefore  important  that  its  services 
are  reliable  so  we  do  not  need  to  resubmit  service  requests  across  the  Internet. 

•  Different  Internet  domains  are  typically  managed  by  different  administrations.  A 
client  can  detect  the  failure  of  a  remote  service,  but  administrators  in  its  domain 
usually  cannot  take  any  corrective  actions  in  the  remote  domain.  Since  the  remote 
management  of  servers  is  generally  not  possible,  it  is  desirable  that  the  servers  are 
highly  reliable  and  capable  of  withstanding  multiple  failures  to  ensure  the  service  is 
rarely  interrupted. 

1.2.2  Functions 

DCS  must  provide  a  variety  of  distributed  services.  Its  primary  function  is  to  support 
conferences  across  a  WAN  and  all  the  associated  activities,  but  DCS  is  unique  among 
conference  systems  in  that  it  also  provides  a  rich  and  powerful  distributed  environment 
that  has  the  potential  to  support  multifarious  distributed  activities.  In  particular,  it  is 
expected  to  run  distributed  applications  such  as  the  distributed  text[37]  and  graphics[39] 
editors  in  use  with  the  earlier  version  of  DCS[38]. 

We  discuss  below  why  reliability  is  important  for  the  DCS  services. 

•  Many  DCS  operations  require  a  causal  multicast  that  enforces  causal  ordering[12].  For 
example,  multicasts  are  convenient  for  disseminating  conference  information  across 


Figure  1.1.    DCS  Operating  Environment 
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participating  Internet  domains.  Causal  multicast  has  been  generally  recognized  as  an 
important  abstraction  for  building  distributed  applications[13,  26,  41]. 

We  developed  a  causal  delivery  mechanism1  based  on  vector  clocks[5][21][35]  (vector- 
ized versions  of  regular  logical  clocks[28]).  Though  functionally  similar  to  the  ISIS 
0i?Cj4STprimitive[ll]  and  many  methods  exist[12][43][45]  to  implement  causal  or- 
dering, our  approach  is  particularly  suited  to  operate  efficiently  in  the  DCS  system. 
However,  since  this  algorithm  is  based  on  causal  delivery,  it  cannot  tolerate  DCS  node 
failures  and  requires  reliable  services  to  function  correctly. 

•  DCS  is  a  real-time  system.  Most  of  its  services  operate  in  real  time  and  each  has  a 
set  of  deadlines.  If  these  services  are  not  reliable,  DCS  cannot  meet  the  deadlines. 
The  following  are  some  of  the  more  important  services  with  deadlines. 

-  Using  the  notification  service[44],  DCS  users  can  request  to  be  notified  when  an 
event  occurs  and  have  the  option  to  specify  how  soon  or  how  often  they  should 
be  notified. 

The  notification  mechanism  is  a  key  DCS  feature  on  which  other  services  depend. 
DCS  must  provide  guarantees  about  delivery  of  notifications  to  conform  to  the 
options  specified  when  the  request  is  made.  For  example,  user  A  may  ask  to 
be  notified  within  10  minutes  after  user  B  joins  a  particular  conference.  This 
service  therefore  has  to  be  made  reliable. 

—  The  decision  support  service  is  a  vital  DCS  service  that  allows  group  decisions 
to  be  made  within  DCS.  Both  DCS  users  and  the  system  use  the  group  decision 
process.    Users  generally  employ  it  for  elections  of  some  kind.    For  example, 


Newman- Wolfe  and  Ramarao  analyze  this[42] 


members  of  a  conference  may  elect  their  chairman  using  the  decision  support 
service.  More  importantly,  the  system  itself  uses  this  service  for  key  decisions 
that  must  be  made  collectively.  For  example,  if  a  user  attempts  to  modify  a 
shared  document,  the  access  control  service  may  check  with  the  concerned  users 
before  allowing  it.  In  each  case,  the  group  decision  process  may  involve  multiple 
domains. 

It  is  easy  to  see  that  failure  of  any  of  the  services  involved  in  making  the  group 
decision  will  directly  affect  the  decision  process.  When  users  are  trying  to  make  a 
decision  synchronously,  delays  in  reaching  a  decision  can  easily  be  a  frustrating 
experience.  More  importantly,  if  the  decision  processes  are  initiated  by  the 
system,  service  failures  will  result  in  incorrect  or  inefficient  operation. 

1.3     Resilient  and  Responsive  Service 

We  have  seen  that  DCS  requires  reliable  services  to  operate  effectively.  However,  we 
must  more  specifically  examine  its  requirements  in  terms  of  resilience  and  responsiveness. 
Loosely  speaking,  resilience  of  a  service  is  the  number  of  failures  it  can  withstand  and 
responsiveness  is  how  rapidly  it  can  provide  the  service. 

If  recovery  from  failures  is  slow,  it  is  possible  for  multiple  failures  to  occur  and  continue 
simultaneously  over  a  period  of  time.  Unless  the  service  is  capable  of  withstanding  all  these 
failures,  it  will  fail  to  provide  service  when  they  occur  together.  That  is,  the  service  resilience 
should  be  commensurate  with  the  recovery  times.  This  vital  requirement  is  examined 
in  more  detail  in  chapter  3.  As  for  responsiveness,  the  DCS  services  should  be  highly 
responsive  due  to  the  real-time  nature  of  DCS.  We  can  therefore  conclude  that  we  require 
services  that  are  highly  resilient  and  responsive. 


Chapter  2  defines  the  notion  of  resilient  and  responsive  service  in  more  formal  terms. 


CHAPTER  2 
RESILIENT  AND  RESPONSIVE  SERVICE 

2.1     Definition 

When  a  client  makes  a  request,  it  is  desirable  for  it  to  receive  a  quick  response.  A 
server  is  responsive  if  it  can  process  requests  and  return  responses  rapidly.  However,  in 
real-life  situations,  failures  are  bound  to  disrupt  its  functioning,  affecting  its  responsiveness 
or  even  interrupting  the  service.  Servers  are  resilient  if  they  can  operate  correctly  in  spite 
of  failures.  Even  a  momentary  break  in  service  may  cause  a  request  to  be  lost,  triggering 
a  client  timeout.  If  the  client  and  server  are  separated  by  a  slow  network,  the  timeout  can 
be  substantial,  effectively  creating  an  amplification  of  the  failure. 

2.2     Models 

Choice  of  a  model  directly  influences  the  nature  and  magnitude  of  the  problem  of 
providing  resilient  and  responsive  service.  It  is  important  to  choose  a  model  general  enough 
to  be  of  use,  but  restrictive  enough  to  make  the  problem  tenable. 

2.2.1     System  Model 

Distributed  systems  can  be  broadly  divided  into  two  categories,  message-passing  and 
shared-memory.  In  message-passing  systems,  processes  communicate  through  messages 
whereas  in  shared-memory  systems,  they  share  objects  of  some  kind.  We  consider  only 
message-passing  systems  in  this  dissertation. 


Message-passing  systems  may  be  further  classified  as  synchronous  or  asynchronous.  A 
system  is  is  synchronous  only  if  we  can  place  bounds  on  delays  in  message  deliveries  and 
relative  speeds  of  processors;  it  is  asynchronous  otherwise.  To  use  the  synchronous  model, 
therefore,  we  need  to  make  assumptions  which  may  not  always  be  realistic,  but  we  can 
readily  use  asynchronous  model  for  any  system. 

Formally,  a  system  is  synchronous  if  it  satisfies  the  following  properties[23]. 

1.  Message  delay  for  sending,  transporting  and  receiving  a  message  over  a  link  has  a 
known  upper  bound,  6. 

2.  Every  process  p  has  a  local  clock  Cp  with  a  known  bounded  drift  rate  p  relative  to 
real  time  t.  That  is,  VJ  >  f,  j^  <  gtlibgtffl  <\  +  p. 

3.  Every  step  executed  by  a  process  has  a  known  upper  bound  on  the  execution  time. 

A  system  is  asynchronous  if  there  is  no  bound  on  message  delay,  clock  drift  or  execution 
time  of  any  step.  We  do  not  consider  intermediate  models[19][20]. 

The  setup  considered  for  this  dissertation  is  an  internetwork  that  has  a  collection  of 


LANs  interconnected  by  a  LAN  or  a  WAN.  Servers  representing  each  Composite  Server  are 


assumed  to  share  a  LAN,  and  the  Composite  Servers  themselves  typically  linked  by  a  WAN. 
It  is  fruitful  to  model  the  Composite  Server  network  and  the  internetwork  differently,  as 
follows. 

Composite  Server 

We  choose  a  synchronous  model  for  the  Composite  Server  for  the  following  reasons. 

1.  Message  delays  on  a  LAN  are  generally  bounded  and  it  is  often  possible  to  determine 
(through  measurements)  a  message  timeout  that  can  be  used  as  the  bound  to  detect 
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failure.   A  Composite  Server  is  not  guaranteed  to  function  properly  if  this  bound  is 
exceeded. 

2.  We  can  achieve  clock  synchronization  using  one  of  many  existing  algorithms[28,  24, 
33,  50],  For  example,  servers  can  synchronize  their  clocks  using  a  time  server  and 
a  bound  can  be  set  on  the  clock  drift  by  controlling  the  rate  at  which  the  servers 
synchronize. 

3.  Each  server  of  a  Composite  Server  is  an  independent  machine  and  we  can  set  a  limit 
on  the  execution  time,  especially  if  we  can  control  the  load  on  the  processor  and  the 
scheduling  of  processes.  That  is,  the  server  processes  can  be  made  to  run  adequately 
fast. 

Message  delays,  clock  drifts  and  execution  times  all  have  known  bounds,  so  we  can 
use  the  synchronous  model  for  the  Composite  Server  network.  This  greatly  simplifies  the 
algorithms  for  the  Composite  Server. 

Internetwork 

We  require  the  asynchronous  model  for  the  internetwork  linking  the  Composite  Servers 
because  of  the  following  reasons. 

1.  It  is  unreasonable  to  set  a  bound  on  the  message  delays  in  a  WAN.  Such  an  assumption 
would  severely  limit  the  usefulness  of  our  system. 

2.  Composite  Servers  can  not  be  expected  to  synchronize  their  clocks  since  they  are 
spread  over  a  WAN. 
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3.  Each  Composite  Server  is  really  a  collection  of  physical  servers.  However,  since  these 
component  servers  reside  on  a  synchronous  system,  it  is  possible  to  set  a  bound  on 
the  execution  time  of  the  Composite  Server  as  a  whole. 

Since  message  delays  and  clock  drifts  cannot  be  bounded,  we  must  use  the  asynchronous 
model. 

2.2.2     Failure  Model 

Schneider[48]  lists  the  following  failure  models  commonly  found  in  distributed  systems 
literature: 

•  Failstop.  A  processor  fails  by  halting.  Once  it  halts,  the  processor  remains  in  that 
state.  The  fact  that  a  processor  has  failed  is  detectable  by  other  processors[46]. 

•  Crash.  A  processor  fails  by  halting.  Once  it  halts,  the  processor  remains  in  that  state. 
The  fact  that  a  processor  has  failed  may  not  be  detectable  by  other  processors[32]. 

•  Crash+Link.  A  processor  fails  by  halting.  Once  it  halts,  the  processor  remains  in 
that  state.  A  link  fails  by  losing  some  messages,  but  does  not  delay,  duplicate,  or 
corrupt  messages[16]. 

•  Receive  Omission.  A  processor  fails  by  receiving  only  a  subset  of  the  messages  that 
have  been  sent  to  it  or  by  halting  and  remaining  halted[40]. 

•  Send  Omission.  A  processor  fails  by  transmitting  only  a  subset  of  the  messages 
that  it  actually  attempts  to  send  or  by  halting  and  remaining  halted[22]. 

•  General  Omission.  A  processor  fails  by  receiving  only  a  subset  of  the  messages  that 
have  been  sent  to  it,  by  transmitting  only  a  subset  of  the  messages  that  it  actually 
attempts  to  send,  and/or  by  halting  and  remaining  halted[40]. 
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•  Byzantine  Failures.  A  processor  fails  by  exhibiting  arbitrary  behavior[34]. 

As  with  system  models,  we  choose  different  failure  models  for  the  Composite  Servers 
and  the  internetwork  linking  them. 

Composite  Server 

We  choose  the  Crash  model  for  the  Composite  Server. 

1.  Each  server  of  the  Composite  Server  is  an  independent  machine  and  is  assumed  to 
exhibit  crash  behavior. 

2.  The  component  servers  are  linked  using  a  connection-oriented  transport  mechanism 
supporting  exactly-once  semantics  (such  as  TCP/IP),  so  we  have  reliable  FIFO  chan- 
nels. 

Internetwork 

General  omission  is  suitable  for  the  internetwork. 

1.  Though  a  Composite  Server  consists  of  multiple  servers  failing  independently,  algo- 
rithms are  designed  such  that  the  cluster  as  a  whole  remains  halted  after  its  failure. 
That  is,  a  Composite  Server  does  not  exhibit  arbitrary  behavior. 

2.  A  connectionless  transport  mechanism  (such  as  UDP/IP)  is  assumed  and  messages 
may  be  dropped  or  delivered  out  of  order.  Exactly-once  semantics  are  implemented 
by  a  separate  algorithm. 

2.3     Resilience  and  Responsiveness 

2.3.1     Metrics 

The  following  metrics  will  be  used  to  evaluate  the  different  approaches  to  achieving 
fault-tolerance. 
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1.  The  resilience  of  a  service  is  /  if  it  can  work  correctly  in  spite  of  failure  of  up  to  / 
servers.  Alternately,  a  service  is  said  to  be  /  fault-tolerant  if  it  can  withstand  failure 
of  up  to  /  servers. 

In  our  DCS  system,  servers  are  manually  started  for  security  reasons.  So,/  needs  to 
be  adequately  large  to  ensure  that  no  more  than  /  servers  fail  before  failed  servers 
are  restarted. 

2.  The  blocking  time^  tj,  is  the  worst-case  lag  between  a  request  and  its  reply  in  a 
failure-free  execution.  This  should  be  small  since  every  request-response  potentially 
suffers  this  delay. 

3.  The  degree  of  replication,  ti„  is  the  number  of  servers  required  to  provide  fault-tolerant 
service. 

4.  The  failover  time,  tj  ,  is  the  worst-case  period  during  which  requests  may  fail  to 
generate  replies. 

5.  The  timeout,  T,  is  what  a  client  uses  to  wait  for  a  reply.  After  T,  it  tries  another 
server.  If  the  network  is  slow,  T  has  to  be  sufficiently  large. 

6.  The  service  time,  T s,  is  the  worst-case  time  a  client  will  need  to  wait  before  it  can 
get  a  response  to  a  request.  This  is  not  the  same  as  tj  because  client  will  need  to 
wait  for  T  between  attempts. 
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2.3.2  Resilience 

A  server  outage  is  said  to  occur  at  time  /  if  a  correct  client  sends  a  request  at  t,  but 
fails  to  receive  a  response  to  that  request.  A  server  is  (k,  D)  -  bofo1  if  all  its  outages  can 
be  grouped  into  at  most  k  intervals,  each  lasting  no  longer  than  D. 

We  need  to  simulate  a  (k,D)  -  bofo  server.  This  forces  the  server  behavior  to  be 
predictable,  simplifying  our  system  design.  It  should  be  able  to  withstand  up  to  /  failures. 

2.3.3  Responsiveness 

We  need  a  server  with  blocking  time  rj  =  0  and  a  reasonably  small  failover  time  77. 
Blocking  time  should  be  very  low  since  every  normal  service  access  potentially  suffers  this 
delay.  Though  failover  time  can  be  higher,  it  is  essential  to  limit  it  so  the  service  can  be 
provided  with  some  performance  guarantee  in  terms  of  its  availability. 

2.4     Server  Requirements 

To  summarize,  we  seek  a  server  that  operates  in  a  synchronous  environment,  experiences 
only  crash  failures,  and  can  withstand  any  specified  number  of  failures.  This  is  our  primary 
goal.  Further,  we  desire  that  it  is  non-blocking,  and  that  it  operates  with  a  minimal  failover 
delay. 

We  will  study  some  of  the  existing  approaches  to  achieving  these  goals,  in  chapter  3. 


'Here,  bofo  stands  for  bounded-outage-finitely-often 


CHAPTER  3 

PREVIOUS  WORK 


3.1     Existing  Solutions 

Services  are  generally  made  reliable  by  replicating  the  service  state  among  multiple 
servers  that  fail  independently.  Correct  servers  can  thereby  continue  providing  service  even 
if  other  servers  fail. 

3.2     State-Machine  Approach 

One  approach  in  use  is  to  replicate  the  service  state  at  all  the  servers,  each  representing 
a  replica.  Clients  then  send  every  request  to  all  the  non-faulty  servers.  This  is  called  the 
state-machine  or  active-replication  method[47].  A  configuration  with  five  replicas  with  a 
client  across  an  internetwork  is  represented  in  figure  3.1. 

This  method  requires  replica  coordination  where  all  the  replicas  receive  and  process 
the  same  sequence  of  requests.  This  is  possible  in  synchronous  systems,  but  is  expensive. 
With  crash  failures,  we  need  /  -f  1  replicas  for  a  resilience  of/,  so  if/  is  large  the  protocol 
messages  can  clog  the  network.  Moreover,  clients  are  generally  separated  from  the  servers 
by  a  WAN,  so  client  multi-casts  to  the  server  replicas  are  not  desirable. 

Formally,  the  state-machine  approach  has  the  following  specifications. 

•  Order 

SMI:  Requests  issued  by  a  single  client  to  a  given  state  machine  sm  are  processed  by 
sm  in  the  order  they  were  issued. 
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Figure  3.1.    The  State-Machine  Approach 
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SM2:  If  request  r  made  to  a  state  machine  sm  by  client  c  could  have  caused  a  request 
r'  to  be  made  by  a  client  c'  to  sm,  then  sm  processes  r  before  r'. 

•  Agreement 

SM3:  All  non-faulty  processors  agree  on  the  same  value. 

SM4:  If  the  client  is  non-faulty,  then  all  non-faulty  processors  use  its  value  as  the  one 
on  which  they  agree. 

3.3     Primary-Backup  Approach 

A  popular  method  that  is  also  widely  used  commercially  is  to  mark  one  of  the  servers  as 
a  primary  and  the  others  as  backups.  Clients  make  requests  only  to  the  primary  and  receive 
responses  from  it.  The  primary  maintains  the  service  state  of  the  backups  by  passing  them 
state  update  messages.  If  the  primary  fails,  one  of  the  backups  takes  over  as  the  primary 
and  the  clients  are  informed  of  the  change.  This  is  widely  known  as  the  primary-backup  or 
primary-copy  method[17j.  Figure  3.2  illustrates  a  setup  with  four  backups. 

The  primary-backup  specifications  are  as  follows. 

PB1:  There  exists  a  local  predicate  P,  on  the  state  of  each  server  s.  At  any  time,  there  is 
at  most  one  server  s  whose  state  satisfies  P,. 

PB2:  Each  client  i  maintains  a  server  identity  D,  such  that  to  make  a  request,  client  i 
sends  a  message  to  Z);. 

PB3:  If  a  client  request  arrives  at  a  server  that  is  not  the  current  primary,  then  that  request 
is  not  enqueued  (and  therefore  is  not  processed). 

PB4:  There  exist  fixed  values  k  and  A  such  that  the  service  behaves  like  a  single  (k,  A)-bofo 
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Figure  3.2.    The  Primary-Backup  Approach 
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This  approach  is  well  suited  for  our  architecture.  Clients  make  a  single  request  and 
the  messages  between  the  primary  and  the  backups  are  few.  If  servers  suffer  from  crash 
failures,  we  can  reduce  tj  to  0  using  n,  =  /  +  1  servers  for  an  /  fault-tolerant  service. 

The  following  are  some  of  the  primary-backup  protocols  in  use  [15]  on  synchronous 
systems.  Table  3.1  compares  the  protocols. 

3.3.1  Alsberg-Day  Protocol 

In  the  Alsberg-Day  Protocol^},  the  client  sends  a  request  to  the  primary  and  blocks. 
The  primary  performs  the  update,  but  passes  it  to  the  backup  and  blocks.  The  backup 
updates  its  state  and  sends  the  response  to  client  and  an  acknowledgment  to  the  primary. 
The  client  and  the  primary  unblock.  Failures  are  detected  with  "Are-You-Alive"  messages 
sent  at  intervals  of  r. 

This  is  a  blocking  protocol  since  every  request  is  passed  to  the  backup  resulting  in  a 
Tb  =  S.  Moreover,  this  is  designed  for  n,  =  2,  so  has  a  resilience  of  only  1.  The  protocol  is 
also  unsuitable  for  using  RPCs  because  the  primary  receives  requests  whereas  the  backup 
sends  the  replies. 

3.3.2  Tandem  Protocol 

In  the  Tandem  Protocol]  the  primary  waits  for  an  acknowledgment  from  the  backup 
before  it  responds  to  the  client.  It  retries  on  failing  to  receive  it. 

This  also  is  a  blocking  protocol,  with  a  rj  =  26  and  is  also  designed  for  n,  =  2. 

3.3.3  HA-NFS  Protocol 

The  HA-NFS  Protocol[VS\  has  a  primary  and  a  backup  that  share  state  through  a  dual- 
ported  disk.  The  "Are-You-Alive"  attempts  are  retried  k  times. 
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Though  this  is  a  non-blocking  protocol,  the  primary  shares  state  with  the  backup 
through  the  disk.  Moreover,  this  is  also  designed  to  use  only  one  backup. 
Table  3.1.    Existing  Primary-Backup  Protocols 


Protocol 

Failures 

nB 

/ 

Tb 

Tf 

Alsberg-Day 

Crash 

2 

1 

6 

t  +  2S 

Tandem 

Crash+Link 

2 

1 

26 

Not  known 

HA-NFS 

Crash+Link 

2 

1 

0 

T  +  4S 

3.4  Need  for  a  New  Protocol 
We  see  that  all  these  existing  primary-backup  protocols  support  only  one  backup.  This 
is  usually  adequate  if  the  Mean  Time  Between  Failures  (MTBF)  is  high  and  failure  recovery 
time  low:  if  the  backup  fails,  it  is  recovered  quickly  so  that  a  live  backup  is  available  when 
the  primary  fails;  if  the  backup  becomes  the  primary,  the  failed  primary  will  soon  be 
recovered  as  the  new  backup  so  that  it  is  again  available  to  take  over  if  necessary.  In  short, 
it  is  extremely  unlikely  for  both  the  primary  and  the  backup  to  be  in  a  failed  state  at  the 
same  time.  However,  a  single  backup  is  insufficient  in  cases  such  as  the  following. 

1.  If  the  recovery  time  is  comparable  to  the  MTBF,  it  is  easily  possible  for  both  the 
servers  to  be  dead  simultaneously,  resulting  in  a  service  failure. 

In  DCS,  each  server  is  trusted  by  other  servers  and  clients.  To  ensure  that  the  servers 
are  authentic,  they  must  be  started  by  a  human  who  goes  through  an  authentication 
procedure.  So,  when  a  server  fails,  this  person  should  be  notified  of  the  failure;  he 
may  however  respond  after  a  considerable  delay  to  begin  the  recovery  process.  All  this 
means  a  worst-case  recovery  time  possibly  in  the  order  of  several  days.  We  studied1 
the  failure  patterns  of  29  machines  in  our  Computer  and  Information  Science  and 


1See  appendix  A  for  details 
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Engineering  (CISE)  department  over  a  period  of  nearly  3  years  and  found  the  MTBF 
to  be  less  than  10  days  for  as  many  as  20  of  the  machines.  One  backup  (/  =  1)  is 
therefore  not  adequate  for  our  purpose,  so  the  existing  protocols  supporting  only  one 
backup  are  unusable. 

2.  There  may  be  occasions  where  a  failed  server  cannot  be  recovered,  such  as  on  a 
space  probe  and  we  simply  have  no  choice  but  to  employ  more  than  one  backup  in 
such  systems.  Though  our  basic  motivation  is  the  DCS  project,  we  recognize  the 
inadequacy  of  current  protocols  in  such  special  situations. 

Apart  from  arbitrary  resilience,  the  existing  primary-backup  protocols  do  not  also  sup- 
port all  the  following  desirable  properties. 

1.  Non-blocking  operation.  We  need  highly  responsive  servers  for  DCS  to  support  its 
real-time  operations. 

2.  Independence  of  service  states.  The  servers  are  physically  separated  and  fail  indepen- 
dently. 

3.  Model  suitable  for  providing  service  through  RPCs.  This  is  not  essential,  but  is 
desirable. 

We  present  our  ring  protocol  in  the  chapter  4  that  can  satisfy  all  these  requirements. 


CHAPTER  4 
RING  PROTOCOL 


4.1     Requirements 

Before  describing  the  ring  protocol,  we  first  restate  the  requirements  of  any  primary- 
backup  approach: 

PB1:  Each  server  s  maintains  a  predicate  Ps  and  Ps  is  true  on  at  most  one  server  at  any 
time. 

PB2:  Each  client  c  maintains  a  server  identity  Sc  and  it  directs  its  requests  only  to  Sc. 

PB3:  No  server  s  for  which  Ps  is  false  processes  requests  coming  from  a  client. 

PB4:  3  constants  k,  D  such  that  the  primary  and  the  backups  simulate  a  single  (k,D)  —  bofo 
server. 

To  understand  the  requirements  more  precisely,  we  specify  our  problem  formally  using 
TLA+,  an  extension  of  TLA,  the  Temporal  Logic  of  Actions[30,  29,  31,  2,  1). 

: module  ServerMod , 


parameters 

S  :   CONSTANT  Id  of  this  server 

(  :  VARIABLE  Real  time 
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cr 

(*) 

:  VARIABLE 

p 

:   CONSTANT 

V 

:  CONSTANT 

V' 

:  VARIABLE 

<t> 

:  BOOLEAN 

p, 

:    BOOLEAN 

Pc 

:   BOOLEAN 

T 

:  CONSTANT 

0 

:   VARIABLE 

Local  clock  value  at  real  time  t 

Maximum  drift  rate  for  local  clock 

Highest  time  for  execution  of  one  step 

Actual  execution  time  for  a  step 

Flag  indicating  if  the  server  is  failed 

Flag  indicating  if  s  is  a  primary 

Flag  telling  whether  to  process  client  requests 

Period  of  I-am-alive  messages 


definitions 

Fail    =     4>  =>  -.Enabled  (na  <  oo) 
PB3    =     Ps&Pc 

assumption 

BndClkDrft     i     -,0  =fr  (V<  >8  :  ^  <  Cf^\-^A9)  <  \  +  p) 
BndExcTime  =     -i^  =►  (r}a  <  n) 

I 

export 

5,  <f>,  Ps,  r,  Ptf  3,  BndClkDrft,  BndExcTime 


-  module  SyncSysMod  - 


parameters 
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6    :  CONSTANT 
£„:    VARIABLE 


Maximum  network  delay 
Actual  network  delay 


h 


import 

ServerMod 


h 


assumption 

BndMsgDly     =     6,<6 

SyncSys  m     A  CJBndMsgDly 

A  nBndClkDrft 
A  nBndExcTime 


h 


export 

6,  SyncSys 


r 


import 

ServerMod 


h 


module  CrashFailMod  - 


definitions 

RmnFld 


^Enabled  (-«£) 


CrashFail  =     URmnFld 
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r 


parameters 

Sc    :   VARIABLE 
St    :   VARIABLE 


h 


•  module  ClientMod  - 


Current  primary 

Server  to  send  requests  to  and  receive  responses  Iron 


definitions 

PB1     =     Sr  =  Sc 

export 
PB2 


L 


r 


import 

ServerMod,  ClientMod 


h 


■  module  PrmyBckpMod 


definitions 

PB\     =     (fls  :  P,)V(3!s  :  Ps) 
PBA     =     SF,(3!s  :  Ps) 
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PrmyBckp  =  A  0PB1 
A  DPB2 
A  DPB3 
A  OPB4 


r 


parameters 

/  :  CONSTANT 
7-j:  VARIABLE 
r/:  VARIABLE 
n„:  VARIABLE 
/,  :  VARIABLE 


module  CmpSrvMod  - 

Resilience 
Blocking  time 
Failover  time 
Degree  of  replication 
Number  of  failed  servers 


import 

ServerMod,  SyncSysMod,  CrashFailMod,  PrmyBckpMod 


h 


definitions 

Vldty         —  PrmyBckp 

NonBlk      =  rh  =  0 

OptFlvr     =  r,  =  /(r  +  M) 

Op(Oeff     =  if  :  n,  =  /  +  I 
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assumption 

Environ     —     A  SyncSys 

A  CrashFail 
ValRes       =     f  e  {x  :  x  £  I  A  x  >  0} 
FailLim     =     V/  :  fa  <  f 

theorem 

CmpSrv         =     A  D  Vldty 
A  UNonBlk 
A  nOptFlvr 
A  OOptDeg 


4.2     Protocol 

Central  to  the  functioning  of  this  protocol  is  a  ring  which  represents  the  order  in  which 
the  role  of  the  primary  is  passed  (see  figure  4.1).  Each  server  has  an  identity  that  enables 
it  to  be  assigned  a  spot  in  this  ring  in  relation  to  other  servers.  A  ring  order  for  a  server  s 
begins  from  s  in  the  ring  and  continues  in  the  ring  direction  up  to  the  server  immediately 
preceding  5.  For  example,  the  ring  order  is  2  —  3  —  4  —  5  —  1  for  server  2.  The  distance  in 
the  ring  from  server  S\  to  $2  is  the  number  of  links  crossed  while  going  from  si  to  S2  in 
ring  order.  So,  distance  from  server  3  to  2  is  4  in  this  example. 

Communication  occurs  only  from  the  primary  to  the  backups  (see  figure  4.2).  The  ring 
structure  is  therefore  unrelated  to  the  actual  communication  links. 
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Figure  4.1.    Ring:  Passage  of  Primacy 
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Figure  4.2.     Ring:  Initial  Communication  Links 


28 

We  now  present  the  protocol.  In  the  following  algorithms,  various  error  conditions  (such 
as  a  backup  sending  a  state-update  message)  are  not  addressed  for  clarity,  but  this  should 
not  affect  our  analysis  or  understanding  of  the  protocol. 

4.2.1     Primary 

The  primaryf)  in  figure  4.3  is  the  main  primary  function  and  maintains  two  basic 
threads  of  activity,  one  to  handle  incoming  client  requests  and  the  other  to  maintain  its 

primary  status. 

•  The  first  thread  takes  the  following  actions  when  a  client  request  arrives. 

1.  It  places  a  copy  of  the  request  in  a  queue  along  with  a  status  flag  indicating  that 
processing  is  not  complete  for  this  request. 

For  the  ring  protocol  to  work,  it  is  essential  that  each  backup  sees  precisely 
the  same  sequence  of  client  requests  as  the  primary.  Since  client  requests  are 
processed  concurrently  by  independent  threads  of  control,  it  is  possible  for  them 
to  complete  out  of  order.  The  queue  maintains  information  about  the  order  and 
status  of  requests  so  that  the  processQ  function  (see  figure  4.5)  can  use  it  to 
dispatch  the  processed  requests  in  the  order  of  their  arrival. 
For  example,  if  request  r„  arrives  before  rt  but  rj  finishes  first,  rj  is  kept 
pending  till  r„  completes  and  then  r,  and  n  are  dispatched  in  that  order. 

2.  It  creates  a  thread  to  process  the  request  asynchronously.  This  makes  the  service 
discipline  processor  sharing  since  all  requests  are  handled  concurrently  by 
threads  with  small  switching  overhead.  We  will  use  this  important  property  in 
our  queuing  model  for  the  server  in  chapter  5. 
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//  queue:  Queue  to  maintain  order  of  client  requests  and  their  status 

//  self:  Identity  of  self 

//  Primary 
primary  () 

//  Thread  to  process  client  requests 
cobegin 
for  ever  do 

request  <—  receive_message() 

queue_insert(gueue,  request=request,  status=INCOMPLETE) 
thread_create(function=process,  args= request) 
endfor 
coend 

//  Thread  to  maintain  primary  status 
cobegin 
for  every  r  time  do 

for  every  server  (other  than  self)  in  reverse  ring  order  do 

send_message(seruer,  type=I_AM-ALIVE) 
endfor 
endfor 
coend 
endprimary 


Figure  4.3.     Primary  Protocol:  Main  Section 
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•  The  second  thread's  only  function  is  to  broadcast  I_AM_ALIVE  messages  periodically 
to  all  the  backups.  This  is  to  prevent  the  backups  from  timing  out  and  becoming  the 
primary.  Figure  4.4  illustrates  these  broadcasts. 

Note  that  the  I_AM_ALIVE  messages  are  sent  in  reverse  ring  order.  This  is  important 
since  the  primary  may  fail  midway  through  the  broadcast  and  sending  them  this  way 
ensures  that  the  nearer  backup  times  out  first. 

Each  client  request  is  processed  asynchronously  by  processQ  (see  figure  4.5)  in  the 
context  of  an  independent  thread. 

Three  basic  activities  take  place  inside  processQ: 

1.  The  request  is  actually  processed  (by  evaluateQ),  resulting  in  a  change  of  server 
state  and  generating  a  response  for  the  client. 

2.  If  the  request  is  not  the  oldest  among  those  being  processed,  no  further  action  is 
taken  because  dispatching  it  immediately  would  change  the  order  of  requests  sent  (as 
STATE-UPDATE  messages)  to  the  backups.  If  it  is  the  oldest,  this  message  along 
with  all  other  completed  but  blocked  messages  are  dispatched  in  their  order  of  arrival. 
The  purpose  of  this  section  is  to  ensure  that  the  arrival  sequence  of  client  requests  at 
the  primary  is  identical  to  that  at  the  backups.  The  primary  and  the  backups  can  be 
assumed  to  be  in  the  same  state  only  if  these  sequences  are  identical. 

3.  Finally,  the  request  is  relayed  as  a  STATE.UPDATE  to  each  of  the  backups  and  the 
response  returned  to  the  client  (see  figure  4.6).  However,  the  following  restrictions 
need  to  be  imposed: 
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//  queue:  Queue  to  maintain  order  of  client  requests  and  their  status 

//  self:  Identity  of  self 

//  update. mutex:  Mutex  lock  to  serialize  updates  and  reply  to  client 

//  Process  client  requests 
process(  re^uesr) 

//  Evaluate  client  request  normally 

response  <—  evaluate(  request) 

//  If  requests  that  arrived  earlier  did  not  complete,  do  nothing 
queue_setitem(?ueue,  request=reguest,  status=COMPLETE) 
if  (request  ^  queue_head(gueue).request)  then 

thread-.exit() 
end  if 

//  Flush  all  the  processed  requests 
while  (queue_head(<?Meue).status  =  COMPLETE)  do 
request  *—  queue^delete(gueue). request 

//  Broadcast  to  backups  and  reply  to  client,  all  in  one  breath 

mutex_lock(  update-inutex) 

for  every  server  (other  than  self)  in  ring  order  do 

send_message( seruer,  type=STATE_UPDATE,  body= request) 
endfor 

send_message(re(/ties(.sender,  body= response) 
mutex_unlock(  update-tnutex) 
endwhile 
endprocess 

Figure  4.5.    Primary  Protocol:  Processing  Requests 
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•  The  broadcast  to  the  backups  must  be  made  before  the  response  to  the  client. 
If  the  response  is  sent  first,  the  primary  may  fail  before  the  STATE.UPDATE 
messages  are  dispatched  and  this  will  leave  the  backups  in  an  inconsistent  state. 

•  The  STATE_UPDATE  messages  should  be  sent  in  ring  order.  It  is  possible  for 
the  broadcast  to  fail  midway  resulting  in  some  of  the  backups  not  receiving  the 
updates.  By  sending  the  messages  in  ring  order,  we  ensure  that  the  next  primary 
(the  backup  that  is  closest  to  the  primary  in  the  ring)  has  the  best  chance  of 
receiving  the  update.  When  it  becomes  the  primary,  it  will  broadcast  this  last 
message  to  all  the  remaining  backups  (see  section  4.2.2). 

•  All  the  update  messages  and  the  response  for  a  request  must  complete  before 
these  messages  can  be  initiated  for  the  next  request. 

Serializing  the  STATE.UPDATE  broadcasts  to  the  backups  provides  a  guarantee 
about  the  transmission  of  updates:  when  a  backup  receives  an  update  message, 
it  can  be  certain  that  the  previous  update  has  been  successfully  dispatched  to 
all  the  backups.  Consequently,  each  backup  will  need  to  cache  only  the  last 
STATE-UPDATE  message  (see  section  4.2.2). 

The  reason  the  client  response  should  also  be  sent  before  starting  on  another 
request  is  that  if  the  other  request  is  also  from  the  same  client,  there  is  a  risk  of 
sending  the  responses  out  of  order. 

4.2.2     Backup 

Figure  4.7  specifies  the  functioning  of  the  backup.  There  are  three  cases: 

1.  Normally,  the  backup  receives  a  STATE.UPDATE  message,  which  is  really  a  relay  of 
the  client  request  to  the  primary. 
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//  primary:  Identity  of  current  primary 

//  self:  Identity  of  self 

//  last -request:  Last  client  request  received 

//  Backup 
backup() 
for  ever  do 

//  Timeout  proportional  to  distance  from  primary 

T  *—  ring_distance(pnman/,  self)  *  (r  +  6) 

message  *—  receive_message(timeout=T) 

switch  message. type 

//  Regular  relay  of  client  request  -  process  normally 
case  STATE_UPDATE: 

//  Make  sure  this  request  has  not  been  received  already 
if  (request  ^  last-request)  then 
last-request  <—  request 

thread_create(function=evaluate,  args= request) 
endif 

//  This  message  is  either  from  current  primary  or  a  new  one 
case  I_AM_ALIVE: 

primary  *—  message,  sender 

II  Timeout  means  primary  has  failed  -  become  primary 
case  TIMEOUT: 
primary  <—  self 

II  First  announce  to  all  other  backups  about  becoming  primary 
for  every  server  (other  than  self)  in  reverse  ring  order  do 

send_message( server,  type=I_AM_ALIVE) 
endfor 

//  Broadcast  last  client  request  since  some  backups  may  not  have  received  it 
for  every  server  (other  than  self)  in  ring  order  do 

send_message( server,  type=STATE_UPDATE,  body -last-request) 
endfor 

//  Run  the  protocol  for  primary 
primaryQ 
endswitch 
endfor 
endbackup 

Figure  4.7.     Backup  Protocol 
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As  explained  in  section  4.2.1,  when  a  backup  becomes  the  primary,  it  broadcasts  the 
last  update  to  all  the  remaining  backups.  So,  it  is  necessary  to  verify  that  the  update 
message  received  is  not  a  retransmission  of  the  last  update.  The  message  is  processed 
only  if  it  is  new. 

The  update  is  processed  asynchronously  (like  the  primary)  by  creating  a  thread.  As 
with  the  primary,  this  results  in  a  processor  sharing  service  discipline  which  is 
useful  in  our  analysis  in  chapter  5.  Note,  however,  that  the  backup  need  not  maintain 
the  order  of  update  arrivals  since  they  are  not  relayed  further. 

2.  Each  backup  regularly  receives  an  I_AM_ALIVE  message  from  the  primary.  If  there 
is  a  change  of  primary,  the  backup  will  modify  its  local  identity  of  the  primary, 
but  otherwise  does  nothing.  The  absence  of  an  I_AM -ALIVE  is,  in  a  sense,  more 
significant  to  a  backup  than  its  presence  since  that  denotes  failure  of  the  primary  and 
requires  action  by  the  backup  (see  next  item). 

3.  When  a  backup  times  out  waiting  for  a  message  from  the  primary,  it  unilaterally  de- 
clares itself  the  primary  and  immediately  announces  this  to  all  the  remaining  backups 
by  broadcasting  an  I_AM_ALIVE  message.  Like  the  I_AM_ALIVE  broadcasts  of  the 
regular  primary,  the  messages  are  sent  in  reverse  ring  order  to  handle  failure  of  the 
broadcast  midway  (see  section  4.2.1). 

Since  the  last  update  broadcast  from  the  primary  may  have  failed  midway,  the  backup 
also  broadcasts  the  last  update  (in  its  cache)  to  the  remaining  backups  (in  ring  order 
for  reasons  given  in  section  4.2.1). 
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Once  these  preliminary  actions  have  been  taken,  the  backup  begins  to  run  the  primary 
protocol  (in  section  4.2.1)  and  the  failover  is  complete.  Figure  4.8  shows  the  messages 
associated  with  a  failover. 

4.3     Operation  Examples 

In  all  our  examples,  we  assume  a  starting  ring  configuration  given  in  figure  4.1,  unless 
stated  otherwise.  The  following  examples  try  to  illustrate  some  of  the  more  important 
activities  of  the  ring  protocol  and  are  not  meant  to  be  comprehensive  as  there  are  numerous 
variations. 

4.3.1  Idle  Behavior 

Consider  figure  4.4.  Every  r  seconds,  server  1  sends  an  LAM_ALIVE  message  to  2,  3, 
4,  and  5  which  wait  for  (r  +  . ),  2(r  +  6),  3(r  +  6),  and  4(r  +  6)  seconds  respectively  for  l's 
message.  Each  of  these  messages  takes  at  most  6  seconds  to  reach  the  backup. 

4.3.2  Failover 

See  figure  4.8.  Primary  server  1  crashes  before  it  completes  its  wait  of  T  seconds  to  send 
the  next  I_AM_ALIVE  broadcast.  Backup  server  2  waits  for  (r  + . )  seconds  after  it  received 
the  last  I_AM_ALIVE  message,  times  out,  and  becomes  the  new  primary.  Immediately,  it 
makes  an  I_A.M_A.LIVE  broadcast  to  the  remaining  backups,  3,  4  and  5. 

Figure  4.9  gives  the  sequence  of  system  transformations  as  the  primary  keeps  failing. 
When  1  fails  in  (a)  2  takes  over  as  the  primary,  when  2  fails  in  (b)  3  takes  over,  and  so 
on.  Note  that  the  situation  of  all  failed  servers  in  (f)  represents  n,  failures  whereas  the 
protocol  is  designed  to  handle  up  to  only  n,  -  1  failures. 
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Figure  4.9.    Transformations  with  Failing  Primary 
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Figure  4.10.    Ring  Transformation  with  Failing  Primary  and  Backups 
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Figure  4.10  illustrates  the  case  in  which  backups  2  and  3  are  already  dead  when  the 
primary  (1)  crashes.  Server  4  times  out  in  3(r  +  6)  seconds  and  becomes  the  primary,  with 
a  single  backup  (5)  remaining. 

4.3.3     Effects  of  Primary  Failure 

Consider  figure  4.6.  Let  us  consider  the  effect  of  the  primary  failing  in  each  of  six  time 
periods,  A  through  F. 

•  Primary  failing  in  A 

The  primary  is  processing  or  has  processed  the  client  request,  but  has  not  initiated 
any  STATE-UPDATE  messages  or  a  response  to  the  client.  After  the  primary  fails, 
no  backup  will  know  of  this  request.  Server  2  will  eventually  time  out  and  become 
the  new  primary.  The  client  will  also  time  out,  connect  to  2  and  resubmit  its  request. 

•  Primary  failing  in  B,  C  or  D 

In  each  of  these  instances,  updates  have  gone  to  only  some  of  the  backups.  Since  the 
first  update  goes  to  2,  it  receives  the  message  in  every  case.  When  this  server  times 
out  and  becomes  the  primary,  it  will  broadcast  the  update  to  the  remaining  backups, 
3,  4  and  5  which  may  have  missed  it  the  first  time  from  the  primary.  In  the  case  of  1 
failing  in  C,  server  3  will  receive  the  updates  from  both  1  and  2,  but  will  discard  the 
one  from  2  since  it  is  a  repeat.  The  client  will  time  out  and  resubmit  the  request  to 
2. 

•  Primary  failing  in  E 

The  updates  have  gone  to  all  the  backups.  Server  2  will  time  out,  become  the  primary 
and  broadcast  the  update  to  3,  4  and  5  (which  will  ignore  it).  Again,  the  client  will 
time  out  and  resubmit  the  request  with  the  new  primary,  2. 
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•  Primary  failing  in  F 

All  the  updates  have  gone  and  the  response  has  been  dispatched  to  the  client.  So, 
consequences  are  similar  to  the  primary  failing  in  E,  except  that  the  client  receives 
its  response,  so  does  not  resubmit  the  request. 

4.4     Validity 

Theorem  1   The  ring  protocol  satisfies  the  primary-backup  specifications. 

Proof:  Let  us  consider  the  primary-backup  requirements,  PB1-PB4  individually: 

•  Initially,  only  one  server  p  is  designated  as  the  primary  and  the  rest  as  backups.  So, 
if  there  is  no  failure,  PB1  is  satisfied  since  p  is  the  only  primary. 

Now  if  p  fails,  at  least  one  of  the  backups  will  time  out  (since  the  timeouts  are  all 
finite)  and  become  the  primary.  Let  this  be  6.  We  prove,  by  contradiction,  that 
another  backup,  6'  cannot  also  have  become  the  primary.  Server  b'  must  lie  either  in 
the  path  from  p  to  6  or  from  6  to  p  in  the  ring. 

-  If  b'  lies  before  6,  according  to  the  protocol,  it  will  time  out  rfj'tfr  +  6)  earlier 
than  b  if  the  ring  distance  from  b'  to  6  is  dj>j.  Immediately,  b'  will  become 
the  primary  and  broadcast  an  LAM-ALIVE  message  to  all  the  other  servers, 
including  6.  Now,  b  has  to  get  the  message  in  r  time  since  that  is  the  worst-case 
message  delay  on  the  network.  Since  b  times  out  dj'j(r  +  #)  later  than  6',  it  will 
receive  the  I_AM_ALIVE  message  from  b'  before  it  can  time  out  and  become  the 
primary.  So,  b  cannot  become  the  primary.  But  this  violates  our  assumption 
that  b  has  become  the  primary,  so  6'  cannot  lie  in  the  path  from  p  to  b. 

-  If  b'  comes  between  6  and  p  in  the  ring,  it  will  time  out  dsi'(r  +  S)  later  than 
6.    Immediately  on  becoming  the  primary,  b  will  broadcast  an  I_AM_ALIVE 
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message  to  all  servers,  including  6'  and  b'  will  receive  it  with  at  most  T  delay. 
Since  b'  times  out  dif(r  +  ^)  later  than  6,  it  cannot  become  the  primary.  But 
we  assumed  that  b'  also  has  become  the  primary,  so  b'  cannot  he  in  the  path 
from  b  to  p  in  the  ring. 

Since  b'  cannot  lie  either  between  p  and  6  or  6  and  p,  it  cannot  exist.  So,  we  can 
have  at  most  one  backup  becoming  the  primary  and  PB1  is  upheld  with  the  failure 
of  the  primary. 

If  a  backup  fails,  it  does  not  affect  PB1  since  there  will  still  be  a  single  primary. 
However,  when  a  primary  fails,  the  ring  will  have  these  failed  backups  which  do  not 
run  the  ring  protocol  and  the  above  arguments  cannot  be  used  directly.  But,  a  failed 
backup  can  be  equated  to  a  working  backup  that  actually  runs  the  protocol,  but  fails 
immediately.  That  is,  if  it  lies  between  p  and  6,  it  can  be  thought  to  become  the 
primary,  but  fail  just  before  broadcasting  its  I_A.M_ALIVE  messages;  it  it  is  between 
6  and  p,  the  moment  of  its  failure  is  irrelevant.  Considered  this  way,  the  arguments 
used  earlier  for  working  backups  can  be  applied  to  the  situation  with  failed  backups, 
so  PB1  can  be  seen  to  be  valid  even  in  such  cases. 

•  PB2  and  PB3  can  be  upheld  by  specifying  the  client  and  the  backup  servers  to  follow 
them. 

•  Now,  the  worst-case  failure  scenario  is  that  of  a  cascade  of  failures  where  the  primary 
and  all  the  backups  but  the  last  fail  in  ring  order.  Since  there  is  a  message  delay  of  6 
and  a  timeout  of  r  +  6,  we  have  a  failure  time  of  r  +  26  for  each  server  failure,  making 
up  a  total  of /(r  +  26)  or  (n,  -  l)(r  +  26)  net  down  time  for  the  service  before  the 
last  of  the  backups  becomes  the  primary.  Therefore,  we  have  D  =  (n,  -  l)(r  +  26),  a 
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constant.  Consequently,  k  is  also  fixed.  Since  k  and  D  exist  and  are  constants,  PB4 
is  satisfied. 

4.5     Properties 
The  ring  protocol  has  the  following  properties. 

1.  There  is  no  bound  on  the  number  of  backups,  so  resilience  /  is  not  limited  by  the 
protocol.  An  n,  degree  of  replication  will  give  us  a  resilience  of  ft,  —  1,  which  is 
optimal  since  at  least  one  server  needs  to  be  operational  for  any  protocol. 

2.  The  primary  sends  the  STATEJJPDATE  messages  asynchronously,  so  there  is  no 
overhead  delay  due  to  the  protocol1.  So,  this  is  a  non-blocking  protocol  with  a 
blocking  time  rj  =  0,  irrespective  of  the  number  of  backups. 

3.  To  tolerate/  failures,  this  protocol  clearly  requires  /  backups,  so  the  degree  of  replica- 
tion n,  =  /  + 1.  Since  at  least  one  server  needs  to  be  functional  to  provide  service  with 
any  protocol,  we  find  that  the  ring  protocol  requires  a  minimal  number  of  servers. 

4.  As  shown  above,  worst-case  failure  in  service  results  from  a  cascade  of  primary  failures 
and  the  failover  time  tj  =  f(r  +  26).  Note  that  with  one  backup  (/  =  1),  this  is  t  +  26 
as  with  the  Alsberg-Day  protocol. 

Table  4.1  compares  the  ring  protocol  with  Alsberg-Day,  Tandem  and  HA-NFS2.  Note 
that  only  the  ring  protocol  tries  to  achieve  arbitrarily  high  resilience,  so  may  be  seen  as 
really  solving  a  different  problem.  For  this  reason,  the  comparison  is  completely  meaningful 
only  for  the  case  of  ns  =  2.  However,  the  table  does  illustrate  the  fact  that  the  ring  protocol 
can  withstand  a  larger  number  of  failures  than  the  existing  protocols. 


We  neglect  the  time  taken  to  construct  and  dispatch  the  messages 
2For  one  retry;  for  k  retries,  tj=t  +  (fc  +  1)26 


Table  4.1.    Primary-Backup  Protocols 


Protocol 

Failures 

n, 

/ 

TJ 

Tl 

Alsberg-Day 

Crash 

2 

1 

6 

t  +  26 

Tandem 

Crash+Link 

2 

1 

28 

Not  known 

HA-NFS 

Crash+Link 

2 

1 

0 

r  +  4<5 

Ring 

Crash 

n,  >  1 

n„  -  1 

0 

(n,  -l)(r  +  2<5) 
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4.6     Probabilistic  Behavior 

The  failover  delay  of  (ns  —  l)(r  +  26)  represents  the  worst-case  period  of  service  failure, 
which  is  the  interval  during  which  none  of  the  servers  is  the  primary  (predicate  Ps  is  true). 
As  explained,  this  results  from  a  cascade  of  primary  failures,  which  is  very  unlikely  to 
occur  in  real  life.  While  the  failover  delay  places  an  upper  bound  on  the  period  of  service 
failure  and  thereby  provides  us  with  a  service  guarantee,  it  will  be  useful  to  analyze  also 
the  probabilistic  behavior  of  the  system  and  evaluate  the  expected  failure  time  as  opposed 
to  worst-case  time. 

4.6.1     Variables  and  Functions 

Basically,  we  consider  the  failures  of  the  servers  (the  primary  and  the  backups)  and 
the  message  delays  to  be  randomly  distributed.  This  gives  us  the  following  variables  and 
functions.  Here,  PDF  and  CDF  stand  for  probability  density  function  and  cumulative 
density  function,  respectively. 

•  Primary  failures 

-  Range:  0  <  x 

-  PDF:  pft(x) 

-  CDF:Ppf(x)  =  f*oPpf(x)dx 
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•  Backup  failures 

-  Range:  0  <  x 

-  PDF:  pv(x) 

-  CDF:  Pb,(x)  =  S*oPi,(x)  dx 

•  Time  between  sending  I_AM-ALIVE  message  and  primary  failure 

-  Variable:  T(x) 

-  Range:  0  <  x  <  r 

-  PDF:/T(i) 

-  CDF:  FT{x)  =  f'0fr(x)  dx 

-  Mean:  E[T(x))  =  /?*./»(*)  dx 

•  Message  delay 

-  Variable:  D(x) 

-  Range:  0  <  x  <  S 

-  PDF:/S(z) 

-  CDF:  Fs(x)  =  J'0fs(x)  dx 

-  Mean:  E[D{x)]  =  Js0x.f6{x)  dx 

4.6.2     Failover  without  Backup  Failures 

Consider  figure  4.11  representing  a  failover  from  server  1  to  server  2,  the  first  backup 
in  the  ring.  The  actual  times  for  message  delay  and  primary  failure  are  random  variables, 
D(x)  and  T(x)  respectively. 
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T+S  +  D(x)-T(x) 


Figure  4.11.    Probabilistic  View  of  Failover 
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We  note,  from  the  figure,  that  service  is  unavailable  for  an  average  period  F(t)  given 

by 

F(t)    =    E[t  +  6  +  D(x)  -  T(x)] 
*    F(t)    =    t  +  6  +  E[D(x)]  -  E[T(x)] 

If  }s(x)  represents  a  uniform  distribution, 

E[D(x)}  =  Js0x.fs(x)dx=S- 
Similarly,  a  uniform  fr(x)  gives  us 

E[T(x)]  =  jlx.fT(x)dx=T- 
Therefore, 

=>     F{t)     =     ^ 

So,  if  message  delays  and  server  failures  are  both  uniformly  distributed,  we  can  expect 
service  to  be  unavailable  for  an  average  period  of 


*>«=** 


whenever  a  primary  fails. 

Here,  we  take  the  service  to  be  available  as  soon  as  the  backup  (server  2)  becomes  the 
primary,  even  though  it  may  fail  soon  after. 
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4.6.3     Failover  with  Backup  Failures 

If  backups  can  also  fail  (as  they  indeed  can),  the  first  backup  may  not  be  available  to 
take  over  when  the  primary  fails.  For  example,  server  2  may  be  in  a  failed  state  when 
1  crashes,  so  we  need  to  wait  an  additional  time  so  server  3  times  out  and  becomes  the 
primary.  That  is,  we  need  to  modify  the  analysis  above. 

At  any  time  t,  the  probability  that  a  backup  is  in  a  failed  state  is  given  by  Ptf(t).  So, 
when  the  primary  fails,  the  first  backup  takes  over  with  a  probability  of  1  -  Phf(t)  or  passes 
with  a  chance  of  Ptf{t).  Similar  odds  occur  at  the  next  backup.  So,  we  will  have  a  failure 
of  service  for  a  mean  period  of  F(t)  given  by 

F(t)      =       *f«  +  Pt/(t)[{T  +  S)  +  Pbf(t)[(T  +  S)+... 

F(t)    =    if*  +  Py(iXr  +  tf)  +  P%(ttf?+  ()  +  ... 
*.    F{t)    =     *¥*  +  (r  +  «)££,i>3J(*) 
=>    F(t)    =     tfL  +  ir  +  QPytf-^L^. 

So,  if  the  server  failures  and  message  delays  are  uniformly  distributed,  we  have  for  the 
general  case  of  any  of  the  servers  failing, 


r(1).I±H+(,tW„l^> 


Note  that  if  the  backups  do  not  fail  (Pbf(t)  =  0),  this  reduces  to 


n*)'^T— 


which  is  the  result  we  obtained  earlier  for  the  case  of  failover  without  backup  failures. 
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4.7     Usefulness 

We  see  that  the  ring  protocol  satisfies  the  primary-backup  specifications  and  is  correct. 
It  meets  all  the  requirements  (see  3.4)  none  of  the  current  primary-backup  protocols  com- 
pletely satisfies:  arbitrarily  high  resilience,  non-blocking  operation,  independence  of  service 
states  and  a  model  suitable  for  providing  service  through  RPCs.  Its  average-case  analysis 
shows  that  the  protocol  scales  well  with  the  degree  of  replication.  This  protocol  is  therefore 
useful  in  building  reliable  servers  for  DCS. 

In  chapter  5,  we  will  construct  a  queuing  network  for  the  composite  server  consisting  of 
the  primary  and  the  backups,  and  derive  its  equivalent  queuing  model. 


CHAPTER  5 
OPERATIONAL  MODEL 


5.1     Modeling  DCS  Servers 

In  this  chapter,  we  study  the  queuing  models  for  our  DCS  servers.  The  arrival  of  requests 
to  the  servers  and  their  processing  at  these  servers  is  not  deterministic.  Further,  each  DCS 
server  actually  comprises  a  primary  and  a  set  of  backups,  so  is  really  a  network  of  physical 
servers.  We  use  conventional  queuing  theory  for  analyzing  the  composite  servers.  We  treat 
the  logical  (composite)  server  as  a  queuing  network,  consisting  of  individual  physical  servers 
(primary  and  backups)  each  behaving  like  an  independent  service  center. 

5.2     Queuing  Theory  Review 

We  briefly  review  a  few  queuing  theory  principles  relevant  to  our  evaluation  of  DCS 
servers.  To  simplify  the  descriptions,  we  restrict  our  attention  only  to  servers  and  incoming 
and  outgoing  messages,  but  the  terms  are  general  and  apply  to  other  situations  as  well. 

5.2.1     System  Parameters 

•  Arrival  Process 

The  incoming  messages  to  a  server,  in  general,  appear  at  random  intervals.  If  we 
assume  that  these  messages  are  independent  and  identically  distributed  (IID),  the 
arrival  process  is  the  distribution  of  the  interarrival  times  for  the  messages.  This  can 
be  one  of  the  following. 


Exponential  (M) 
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Figure  5.1.    Composite  Server  Architecture 
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-  Erlang  with  parameter  k  (Et) 

—  Hyperexponential  with  parameter  k  (Hk) 

—  Deterministic  (D) 

-  General  (G) 

•  Service  Time  Distribution 

The  time  a  server  takes  to  service  an  incoming  message  is  again  probabilistic.  As 
with  the  distribution  of  interarrival  times  for  incoming  messages,  we  take  the  service 
times  to  be  IID.  Service  time  distribution  also  can  be  one  of  M ,  Et,Ht,  D  and  G. 

•  Number  of  Servers 

This  is,  like  the  name  suggests,  the  number  of  servers  available  to  process  an  incoming 
message.  If  we  have  n,  servers,  it  is  therefore  possible  to  process  up  to  n,  incoming 
messages  concurrently. 

•  System  Capacity 

The  capacity  of  the  system  being  analyzed  affects  our  analysis.  For  servers  in  com- 
puter systems,  this  usually  translates  to  buffer  limitations.  If  the  buffers  are  large 
enough  for  all  expected  arrival  rates,  we  can  take  this  to  be  infinite,  making  it  easier 
to  analyze. 

•  Population  Size 

This  is  the  number  of  messages  in  the  entire  system.  If  this  is  large,  it  will  not  affect 
our  analysis  and  therefore  simplify  it  considerably. 

•  Service  Discipline 

The  order  in  which  incoming  messages  are  processed.  If  there  is  a  constant  delay  for 
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each  message,  the  server  becomes  an  Infinite  Server  (IS)  or  a  delay  center.  The 
processing  can  also  follow  a  scheduling  policy  such  as  First  Come,  First  Served 
(FCFS),  Last  Come,  First  Served  (LCFS),  Round  Robin  (RR)  and  Pro- 
cessor Sharing  (PS).  PS  is  a  special  case  of  RR  with  a  very  small  quantum  and 
negligible  context  switching  overhead.  So,  effectively,  all  the  incoming  messages  are 
simultaneously  processed  and  share  the  server. 

5.2.2  Kendall  Notation 

The  Kendall  notation  is  generally  used  to  denote  a  queuing  system.  With  this  notation, 
A/S/m/B/K/SD  represents  a  system  with  arrival  process  A,  service  time  distribution  5, 
m  servers,  system  capacity  of  B,  population  size  of  K  and  service  discipline  SD.  We  use 
the  Kendall  notation  to  specify  our  DCS  servers. 

5.2.3  Variables 

Variables  normally  used  in  queuing  analysis  are  as  follows.  All  except  A  and  \i  are 
random  variables. 

r  Interarrival  time  of  messages 

A  Mean  arrival  rate  (=  J-i) 

s  Service  time  per  message 

fi  Mean  service  rate  per  server  (=  -gj-t) 

n  Total  messages  in  system,  including  those  being  processed 

nq  Messages  waiting  to  be  processed 

n,  Messages  being  processed 
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r    Response  time,  including  waiting  and  processing  time 

w    Waiting  time 

5.2.4     Poisson  Processes 

Poisson  processes  are  invaluable  in  queuing  theory  because  they  represent  memoryless 
behavior  for  arrivals.  In  addition,  if  the  arrivals  to  a  single  server  with  exponential  service 
time  are  Poisson  with  mean  rate  A,  the  departures  are  also  Poisson  with  the  same  rate  A, 
provided  the  arrival  rate  A  is  less  than  the  service  rate  //.  We  use  these  properties  in  our 
ensuing  analysis. 

5.3     Server  Models 

Each  DCS  server  is  composite  and  is  represented  by  a  primary  and  a  set  of  backups. 
Both  the  primary  and  the  backups  are  single  and  independent  entities.    We  first  model 
these  individual  servers  and  then  build  a  queuing  network  for  the  composite  server. 
5.3.1     Individual  Server 

The  behavior  of  the  primary  is  different  from  the  backups.  We  therefore  model  them 
separately. 

•  Primary 

The  primary  receives  client  requests,  processes  them  and  sends  back  responses.  In 
addition,  for  each  request,  the  primary  passes  state-update  messages1  to  each  of  the 
backups.  The  system  parameters  are  as  follows. 

-  Arrival  Process 

We  take  this  to  be  Poisson.  In  our  DCS  system,  requests  can  arrive  from  clients 


These  are  relays  of  the  request,  in  the  ring  protocol 
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Figure  5.2.     Queuing  Model  for  the  Primary 
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directly  linked  to  the  server  or  from  other  DCS  servers.  Each  of  the  clients  is 
typically  independent  and  the  request  arrivals  cannot  generally  be  predicted. 
Similarly,  the  servers  sending  requests  are  also  independent  and  distributed  in 
different  domains  and  their  requests  are  unpredictable.  Such  memoryless  behav- 
ior is  best  represented  by  Poisson  streams[25]. 

-  Service  Time  Distribution 

We  make  no  assumptions  about  this.  So,  this  distribution  is  taken  to  be  general. 

Number  of  Servers 

The  physical  server  we  are  considering  is  single.  Though  each  request  results  in 

an  independent  thread  of  activity,  the  server  itself  remains  single. 

System  Capacity 

We  assume  this  to  be  unbounded,  to  simplify  analysis.  In  practice,  the  capacity 

is  rarely  exceeded,  so  this  is  a  reasonable  assumption. 

Population  Size 

The  requests  that  can  come  to  a  server  are  infinite,  so  this  is  is  unbounded  as 
well. 

Service  Discipline 

The  primary  server  spawns  a  thread  to  process  a  request,  each  thread  running  at 
the  same  priority.  Since  the  server  time  is  shared  equally  across  the  requests,  this 
can  be  approximated  to  processor  sharing  if  we  neglect  the  overhead  involved 
to  switch  the  contexts. 
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So,  what  we  have  is  an  M /G/1/oo/oo/ PS  system.  An  M/G/l  system  with  processor 
sharing  has  been  shown[27]  to  be  equivalent  to  an  M/M/l  system,  so  a  primary  can 
be  represented  as  M/M/l. 

Figure  5.2  shows  the  modeling  for  a  primary  server2.  We  first  show  the  primary  as 
M/G/l  in  (a),  generating  one  state-update  message  per  backup  for  every  response 
to  client.  The  service  discipline  is  not  specified.  In  (b),  each  request  is  shown  to 
be  processed  by  a  separate  thread,  but  within  the  same  server.  This  represents  a 
processor-sharing  (as  explained  earlier)  service  discipline  and  (c)  shows  the  primary 
to  be  equivalent^  an  M/M/l  system.  We  will  use  the  simplified  model  in  (c)  for  our 
analyses. 

•  Backup 

Unlike  the  primary,  a  backup  has  no  out-going  messages:  it  only  receives  and  processes 
the  state-update  messages  from  the  primary.  Consequently,  its  model  is  different.  The 
system  parameters  are,  however,  similar  and  are  as  follows. 

-  Arrival  Process 

Each  request  to  the  primary  results  in  a  state-update  message  to  the  backups. 
Since  the  request  arrivals  to  the  primary  are  Poisson  streams,  the  arrival  of 
state-update  messages  to  the  backups  will  also  be  Poisson. 

-  Service  Time  Distribution 

The  primary  and  the  backup  are  identical  in  terms  of  their  processing  of  re- 
quests. The  state-update  messages  received  by  the  backups  are  really  relays  of 


See  section  4.2.1  for  the  algorithn 
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the  original  client  requests.  The  service  time  distribution  will  be  identical  to 
that  of  the  primary.  Like  with  the  primary,  it  is  general. 

—  Number  of  Servers 

Like  the  primary,  each  backup  is  a  single  server,  though  each  request  results  in 
an  independent  thread  of  activity. 

—  System  Capacity 

Again,  this  is  taken  to  be  infinite  to  simplify  analysis. 

—  Population  Size 

Since  each  client  request  to  the  primary  results  in  a  state-update  message  to 
each  backup,  the  population  size  is  identical  to  that  of  the  primary.  Since  the 
client  requests  are  taken  to  be  unbounded  for  the  primary,  this  size  is  infinite. 

—  Service  Discipline 

Like  the  primary,  a  backup  spawns  a  thread  to  process  a  request,  each  thread 
running  at  the  same  priority.  So,  like  with  the  primary,  this  can  be  approximated 
to  processor  sharing  if  we  neglect  the  overhead  involved  to  switch  the  contexts. 

So,  each  backup  is  also  represented  by  an  M /G/1/oo/oo/PS  model.  Again,  since 
an  M/G/l  system  with  processor  sharing  is  equivalent  to  an  M/M/l  system,  each 
backup  can  be  modeled  as  M/M/l. 

Figure  5.3  shows  the  modeling  for  a  backup  server3.  As  with  the  modeling  of  the 
primary,  (a)  shows  the  backup  as  an  M/G/l  system,  with  an  unspecified  service 
discipline.  In  (b),  each  request  is  shown  to  be  processed  by  a  separate  thread,  within 


See  section  4.2.2  for  the  algorithm 
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the  same  server.    Finally,  (c)  shows  the  backup  to  be  equivalently  M/M/l  as  the 
service  discipline  is  processor-sharing. 

5.3.2     Composite  Server 

Using  the  M/M/l  models  for  the  primary  and  the  backups,  and  the  architecture  of  the 
composite  server  in  figure  5.1,  we  get  the  queuing  network  of  figure  5.4. 
We  make  the  following  observations  about  the  composite  server  network. 

•  As  shown  earlier,  the  primary  queuing  system  can  be  treated  as  M/M/l.  The  incom- 
ing request  messages  constitute  a  Poisson  stream  and  the  service  discipline  can  be 
taken  to  be  exponential  (though  in  reality  we  made  no  assumptions  and  considered 
it  as  general). 

•  As  per  the  ring  protocol,  the  primary  sends  a  state-update  message  to  each  of  the 
backups  exactly  once  for  each  response  to  client  and  at  precisely  the  same  time.  So, 
each  of  the  state-update  message  streams  follow  the  same  distribution  as  the  outgoing 
response  stream.  If  the  the  incoming  request  stream  has  a  mean  rate  of  A,  we  can 
conclude  (using  the  property  in  section  5.2.4)  that  not  only  will  the  response  stream 
be  Poisson  with  mean  rate  A,  but  each  of  the  state-update  messages  streams  to  the 
backups  will  also  be  Poisson  with  the  same  mean  rate  A.  Of  course,  we  assume  the 
basic  stability  condition  that  the  arrival  rate  A  is  less  than  the  service  rate  /i. 

5.4     Using  the  Model 

The  model  in  figure  5.4  represents  a  DCS  (composite)  server  and  can  be  used  in  analyz- 
ing complex  systems  using  such  servers.  As  explained  in  chapter  1  and  shown  in  figure  1.1, 
DCS  consists  of  a  number  of  domains  each  containing  one  logical  (composite)  server.  So, 
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Figure  5.4.     Queuing  Model  for  Composite  Server 
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we  can  use  the  queuing  model  for  the  composite  server  in  analyzing  the  queuing  network 
for  the  DCS  setup.  We  do  this  in  chapter  6. 


CHAPTER  6 
DCS  WITH  RESILIENT  AND  RESPONSIVE  SERVERS 

6.1     DCS  with  Composite  Servers 

As  described  in  chapter  1,  the  DCS  environment  consists  of  multiple  LANs  intercon- 
nected by  a  WAN  (see  figure  1.1).  Each  LAN  contains  one  (logical)  DCS  server1. 

To  incorporate  our  resilient  and  responsive  servers  into  the  DCS  setup,  we  use  the 
composite  server  consisting  of  a  primary  and  backups  to  act  as  the  logical  DCS  server. 
Figure  6.1  shows  an  example  configuration  with  two  LANs  and  composite  servers  employing 
four  backup  servers  each. 

6.2     Need  for  Additional  Components 

Figure  6.2  shows  an  expanded  view  of  the  activities  within  a  LAN.  There  are  two 
problems  that  we  should  handle: 


1.  For  a  primary  to  make  a  request  to  another  primary,  it  must  have  the  following 
information: 

•  The  addresses  of  all  the  servers  (the  primary  and  the  backups)  in  the  remote 
LAN. 

•  The  identity  of  the  current  primary. 

The  primary  should  then  direct  requests  only  to  the  current  primary. 


For  more  details  on  the  DCS  architecture,  see  appendix  B 
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Figure  6.2.    Activities  in  Domain 
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2.  Incoming  requests  may  not  be  causally  ordered.  The  primary  must  reorder  these 
messages  appropriately. 

Figure  6.3  shows  additional  components  for  the  primary  to  address  these  two  problems. 

The  clerk[49]  essentially  hides  the  composite  nature  of  the  remote  server  and  automati- 
cally addresses  requests  to  the  remote  primary.  Some  of  the  preferred  ways  in  which  clerks 
can  be  employed  are  depicted  in  figure  6.4  and  described  below. 

1.  The  clerk  can  be  a  distinct  process  communicating  with  the  client  through  one  of 
the  IPC  mechanisms.  This  scheme  is  flexible,  but  inefficient  and  inelegant.  It  also 
requires  another  process. 

2.  We  can  make  the  clerk  part  of  the  client.  It  can  be  linked  statically  or  made  part  of  a 
shared  library  and  linked  at  run  time.  One  major  benefit  of  using  dynamic  linking  is 
that  it  becomes  possible  to  use  existing  dynamically  linked  software  designed  to  work 
with  regular  servers. 

3.  We  can  make  clerk  code  part  of  the  kernel,  but  this  requires  modification  of  the  kernel 
which  is  generally  undesirable  or  not  possible. 

We  have  chosen  to  employ  the  clerk  as  a  statically  linked  library. 

The  causal  delivery  unit  reorders  incoming  requests  (from  remote  primaries,  not  from 
local  clients)  to  preserve  causality. 

6.3     Queuing  Model 

Using  our  model  for  the  primary  developed  in  chapter  5,  we  derive  the  queuing  network 
represented  in  figure  6.5.  Though  the  clerk  component  currently  exists,  we  do  not  know  its 
queuing  behavior;  the  causal  delivery  unit  is  yet  to  be  developed. 
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6.4     Resilient  and  Responsive  Service 

We  see  that  by  employing  clerks,  we  can  use  our  composite  servers  in  the  place  of 
regular  servers.  This  makes  the  servers  resilient  and  responsive,  which  was  our  primary 
goal2.  We  are  now  in  a  position  to  implement  causal  multicasts  for  DCS  and  the  resilience 
and  responsiveness  of  the  servers  makes  real-time  operation  of  DCS  possible. 

Note  that  though  composite  servers  have  been  developed  with  DCS  in  mind,  they  can 
be  used  for  any  distributed  application  designed  to  run  on  a  similar  environment. 


2See  chapter  1 


CHAPTER  7 
CONCLUSION  AND  FUTURE  WORK 

7.1     Conclusion 

We  have  seen  that  DCS  requires  resilient  and  responsive  servers  to  support  its  causal 
multicasting  and  real-time  operation.  Loosely  speaking,  resilience  of  a  service  is  the  number 
of  failures  it  can  withstand;  responsiveness  is  how  rapidly  it  can  provide  the  service. 

After  studying  the  operation  and  failure  models  for  the  DCS  environment,  we  discovered 
that  a  server  that  operates  in  a  synchronous  environment  and  experiences  only  crash  failures 
can  meet  our  needs.  Since  failed  servers  will  be  recovered  by  humans,  we  found  that  multiple 
backups  will  be  necessary  to  reduce  the  chance  of  all  the  servers  failing  simultaneously. 
Further,  it  is  desirable  for  the  servers  to  be  non-blocking,  and  to  operate  with  a  minimal 
failover  delay  for  efficient  operation. 

We  found  that  none  of  the  existing  primary-backup  protocols  entirely  supports  the 
following  required  properties: 

1.  Arbitrarily  high  resilience.  DCS  absolutely  requires  this  to  allow  for  multiple  server 
failures. 

2.  Non-blocking  operation.    We  need  highly  responsive  servers  for  DCS  to  support  its 
real- time  operations. 

3.  Independence  of  service  states.  The  servers  are  physically  separated  and  fail  indepen- 
dently. 
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Table  7.1.     Existing  Primary-Backup  Protocols 


Protocol 

Failures 

»., 

/ 

T\ 

Tl 

Alsberg-Day 

Crash 

2 

1 

6 

r  +  26 

Tandem 

Crash+Link 

2 

1 

26 

Not  known 

HA-NFS 

Crash+Link 

2 

1 

0 

r  +  4,5 

4.  Model  suitable  for  providing  service  through  RPCs.    This  is  not  essential,  but  is 
desirable. 

Specifically,  table  7.1  shows  their  parameters. 

We  then  developed  our  ring  protocol  which  satisfies  the  primary-backup  specifications 
and  showed  that  it  is  correct.  We  found  that  it  meets  all  the  requirements  that  none 
of  the  current  primary-backup  protocols  completely  satisfies:  arbitrarily  high  resilience, 
non-blocking  operation,  independence  of  service  states  and  a  model  suitable  for  providing 
service  through  RPCs.  Its  average-case  analysis  also  showed  that  the  protocol  scales  well 
with  the  degree  of  replication.  We  concluded  that  this  protocol  is  useful  in  building  reliable 
servers  for  DCS. 

We  then  modeled  our  composite  server  consisting  of  a  primary  and  a  set  of  backups  and 
simplified  it  so  it  can  be  used  in  analyzing  complex  systems  using  such  servers.  Since  DCS 
consists  of  a  number  of  domains  each  containing  one  logical  (composite)  server,  we  can  use 
the  queuing  model  for  the  composite  server  in  analyzing  the  queuing  network  for  the  DCS 
setup. 

We  found  that  by  employing  clerks,  we  can  use  our  composite  servers  in  the  place  of 
regular  servers,  making  the  servers  resilient  and  responsive,  which  was  our  primary  goal. 
This  makes  it  possible  to  implement  causal  multicasts  for  DCS  and  the  resilience  and 
responsiveness  of  the  servers  makes  real-time  operation  of  DCS  possible. 


74 

Though  composite  servers  have  been  developed  with  DCS  in  mind,  we  found  that  they 
can  be  used  for  any  distributed  application  designed  to  run  on  a  similar  environment.  This 
makes  our  composite  servers  widely  applicable. 

Our  setup  of  the  primary  and  the  backups  within  individual  LANs  serving  clients  across 
a  WAN  is  a  special  case  (B)  of  the  possibilities  illustrated  by  table  7.2. 

Table  7.2.    Possible  Separation  of  Primary  with  Backup(s)  and  Client 


Case 

Primary-Backup(s) 

Primary-Client 

A 

LAN 

LAN 

B 

LAN 

WAN 

C 

WAN 

LAN 

D 

WAN 

WAN 

We  have  assumed1  that  the  primary  and  the  backups  for  each  composite  server  share 
the  same  LAN,  which  we  modeled  as  a  synchronous  system  suffering  only  crash  failures. 
We  however  let  the  primary  and  the  clients  be  separated  by  a  WAN,  treating  it  as  an 
asynchronous  system  exhibiting  general-omission  failures. 

Let  us  now  consider  the  different  setups: 

A.  Here,  the  primary  and  the  backups  are  all  within  one  LAN,  so  satisfy  our  assumption. 
Any  protocol  that  can  operate  in  an  asynchronous  system  with  general-omission  fail- 
ures will  function  in  a  synchronous  system  with  crash  failures.  Our  assumption  that 
the  primary  and  the  clients  are  separated  by  a  WAN  is  broad  enough  to  include  the 
case  of  them  being  separated  by  a  LAN.  The  ring  protocol  is  therefore  applicable  in 
this  setup. 


See  chapters  1  and  2  for  complete  set  of  assumptions 
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B.  This  is  precisely  the  arrangement  we  assumed  in  this  dissertation  and  we  have  already 
shown  in  chapter  4  that  the  ring  protocol  is  correct  in  such  an  environment. 

C.  As  per  our  modeling  of  a  WAN,  the  primary  and  the  backups  will  operate  in  an 
asynchronous  environment  with  general-omission  failures  if  they  are  separated  by  a 
WAN.  The  ring  protocol  cannot  operate  correctly  in  such  a  setup  since  it  uses  both 
the  synchronous  behavior  (which  limits  network  delays)  and  the  restriction  to  crash 
failures  (where  messages  are  not  dropped)  heavily  in  its  functioning.  The  ring  protocol 
therefore  cannot  be  used  in  such  a  setup. 

D.  For  the  same  reasons  in  C,  the  ring  protocol  is  inapplicable  in  this  arrangement  as 
well. 

7.2     Future  Work 

The  following  work  remains  to  be  done. 

1.  The  crash  failure  model  for  the  LAN  is  not  entirely  realistic  since  communication 
failures  are  possible  and  the  messages  between  the  primary  and  the  backups  may  be 
dropped.  We  need  to  simulate  crash  failures  using  mechanisms  such  as  recoverable 
queues[9]. 

Using  connection-oriented  links  (such  as  TCP/IP)  does  not  help  since  the  queued 
messages  will  be  lost  when  the  sender  crashes,  and  if  we  avoid  a  queue  by  using  a 
stop-and-wait  approach2,  the  protocol  will  become  blocking.  Using  primitives  such 
as  Amoeba  SendToGroup[26]  also  does  not  eliminate  the  problem. 
It  is  possible  to  increase  the  fault-tolerance  of  an  algorithm  automatically  by  trans- 
lating it  to  one  that  tolerates  a  more  severe  failure.    Many  such  translations  have 


where  the  sender  waits  for  in  acknowledgment  from  the  receiver  before  sending  the  next  message 
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been  proposed[14,  18,  51,  36,  7,  8].  A  prediction  of  the  failures  in  future  systems[4] 
suggests  that  a  receive-omission  failure  model  will  be  appropriate,  so  it  may  be  useful 
to  extend  the  ring  protocol  to  tolerate  receive-omission  failures. 

2.  There  is  no  mechanism  currently  to  alter  the  ring  structure:  each  server  has  a  perma- 
nently assigned  id  and  location  in  the  ring.  We  may  be  able  to  add  and  drop  servers 
from  the  ring  with  the  following  actions. 

(a)  Give  the  new  ring  structure  first  to  the  primary,  and  then  to  the  new  backups 
in  the  new  ring  order.  This  needs  to  be  analyzed  for  correctness  and  effect  on 
the  guarantees  of  the  ring  protocol. 

(b)  Inform  all  the  clients  about  the  new  members  in  the  ring  (we  do  not  need  to  give 
them  its  structure). 

(c)  Inform  the  name  servers  about  the  new  ring  members  so  future  resolutions  for 
the  domain  name  will  yield  the  list  of  new  members. 

3.  It  may  sometimes  be  useful  to  allow  backups  to  respond  directly  to  client  requests  (not 
for  load-sharing,  because  the  primary  and  the  backups  all  process  precisely  the  same 
set  of  requests  in  the  ring  protocol).  However,  the  primary-backup  specification3 
prohibits  backups  from  responding  (PB3)  and  our  ring  protocol  will  not  function 
correctly  if  the  states  of  the  backups  change  independent  of  the  primary.  One  way 
to  overcome  this  limitation  is  to  recognize  client  requests  that  will  not  change  server 
state  (such  as  pure  reads)  and  allow  processing  of  these  by  the  backups  as  well.  Since 
the  backups  lag  the  primary  in  state  transitions,  the  responses  from  the  backups  may 
be  stale,  but  this  is  often  acceptable. 


See  section  3.3 
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Recognizing  requests  that  do  not  change  states  will  also  allow  the  primary  to  optimize 
by  not  passing  these  as  state-updates  to  the  backups.  We  can  then  run  different 
requests  independently  on  different  servers  and  achieve  a  considerable  performance 
gain. 

4.  We  need  to  recover  failed  servers  without  violating  the  guarantees  of  the  ring  protocol. 
The  basic  problem  is  to  synchronize  the  state  of  the  recovering  server  (which  comes 
up  as  a  backup)  with  that  of  the  primary. 

(a)  One  obvious  way  is  to  use  the  quiet-point  approach  of  database  systems  and  wait 
for  a  moment  when  the  primary  has  no  pending  client  requests.  It  can  then  pass 
its  state  to  the  backup  and  continue  normally,  with  the  new  backup  included  in 
its  broadcasts. 

In  practice,  however,  it  may  be  considerably  long  before  the  primary  experiences 
a  quiet-point.  Moreover,  requests  will  not  be  processed  till  the  entire  state  is 
transferred  to  the  recovering  backup  and  this  violates  our  blocking  time  (rj  =  0) 
and  failover  time  (77  =  f{r  +  26))  guarantees. 

(b)  A  better  way  is  to  use  follow  a  check-point  based  approach  and  update  the  state 
gradually.  For  example,  the  primary  can  transfer  its  last  check-pointed  state  to 
the  backup  without  halting  its  processing  of  client  requests.  Once  the  state  has 
been  transferred,  it  can  pass  all  the  client  requests  received  since  that  check- 
point. This  way,  there  will  be  no  break  in  service,  but  the  primary  will  need  to 
queue  requests  between  check-points.  We  need  to  analyze  the  impact  of  failures 
on  the  contents  of  this  queue,  and  possibly  modify  the  current  requirement  that 
backups  cache  only  the  last  state-update  message. 
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5.  Clerks  should  be  dynamically  linked  and  the  protocol  tested  with  existing  services. 
This  will  expand  the  scope  of  our  work  considerably. 

6.  Multicasting  has  to  be  made  causal.  Currently,  the  causal  delivery  unit  (CDU)  de- 
picted in  figure  6.3  does  not  exist,  so  multicasts  do  not  preserve  causality. 

7.  Models  for  the  clerk  and  CDU  have  to  be  built  to  analyze  the  behavior  of  the  servers. 
For  example,  the  CDU  will  reorder  the  incoming  requests,  so  the  queuing  process  is 
complicated. 

8.  Primaries  do  not  currently  discriminate  between  requests  from  local  clients  and  those 
from  other  primaries.  This  should  be  handled  when  the  CDU  is  developed. 

7.3     Last  Word 

It  is  clear  from  these  discussions  that  while  the  ring  protocol  is  invaluable  in  achieving 
arbitrary  resilience  in  synchronous  systems,  it  has  to  be  strengthened  in  many  areas  for  it 
to  be  widely  applicable. 


APPENDIX  A 
HOST  FAILURE  STATISTICS 

The  following  statistics  have  been  derived  from  the  wtmp(5V)  files  on  the  hosts. 
Table  A.l.     Host  Failure  Statistics 


|  HOST 

DURATION 

^failures 

MTBF   | 

azalea 

1050d 

92 

lid  9h 

beach 

674d 

79 

8d  12h 

buoy 

1056d 

88 

12d  Oh 

camellia 

1056d 

118 

8d22h 

coconut 

1066d 

152 

7d  Oh 

coppertone 

964d 

203 

4d  18h 

crane 

1071d 

146 

7d  8h 

cutter 

1070d 

187 

5d  17h 

dragonfly 

1069d 

186 

5d  17h 

glade 

1056d 

101 

lOd  lOh 

insect 

716d 

125 

5d  17h 

jetty 

1054d 

154 

6d20h 

ketch 

1070d 

222 

4d  19h 

minnow 

1041d 

125 

8d  7h 

moccasin 

677d 

34 

19d  22h 

native 

1065d 

220 

4d20h 

quicksand 

1071d 

166 

6d  lOh 

rattler 

677d 

22 

30d  19h 

rock 

1061d 

206 

5d3h 

sabal 

1056d 

126 

8d9h 

sandal 

1048d 

57 

18d  9h 

schooner 

1056d 

169 

6d  6h 

sloop 

1068d 

209 

5d  2h 

sunburn 

530d 

14 

37d  21h 

thunder 

847d 

199 

4d  6h 

urchin 

1065d 

126 

8d  lOh 

yacht 

1063d 

132 

8d  lh 

yawl 

134d 

16 

8d  9h 

zinnia 

1053d 

111 

9d  llh 
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APPENDIX  B 
DCS  TOP-LEVEL  ARCHITECTURE 


A  collection  of  DCS  servers  invoked  for  a  common  purpose,  along  with  the  associated 
resources  represents  a  DCS  instance.  It  is  possible  for  several  instances  to  coexist  inde- 
pendent of  each  other,  each  with  its  own  set  of  servers  and  resources.  It  is  currently  not 
required  for  one  instance  to  be  able  to  communicate  with  another. 

A  DCS  domain  or  site  is  a  LAN  of  machines  typically  sharing  the  same  Internet  (ad- 
ministrative) domain.  A  DCS  domain  is  characterized  by  both  physical  and  administrative 
proximity  of  its  processors,  allowing  direct  control  of  one  by  another.  Each  DCS  site  sup- 
ports only  one  DCS  server. 

A  typical  instance  of  DCS  is  depicted  in  figure  B.l. 
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Figure  B.l.    Example  Top-level  DCS  Architecture 


APPENDIX  C 
DCS  SERVICES 

Each  DCS  server  supports  the  following  services.  Details  about  actual  functions  in  the 
implementation  are  in  appendix  D. 

1.  Access  Control  Service  (ACS)  provides  security  of  access  to  objects  within  the 
server. 

2.  Application  and  File  Service  (APP)  enables  use  of  applications  from  within 
DCS.  Importing,  installing,  registering,  invoking  and  exporting  of  applications  are  all 
handled  by  this  service. 

3.  Conference  Service  (CNF).  Manages  conferences1  and  their  objects.  CNF  handles 
all  operations  related  to  conferences,  such  as  joining  a  conference  and  adding  a  user 
to  it. 

4.  Control  Service  (CTL)  enables  base  control  of  the  DCS  system.  All  low-level 
activity  including  communication  between  servers  and  threading  is  handled  here. 

5.  Database  Service  (DBS)  provides  databases  for  all  other  services.  Services  manage 
their  persistent  data  only  through  this  service. 

6.  Decision  Support  Service  (DSS)  supplies  a  variety  of  decision  mechanisms.  De- 
cisions may  be  made  by  humans  or  software  entities. 


See  chapter  1  for  description  of  conference 
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Figure  C.l.    Services  of  a  DCS  Server 
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7.  Global  Directory  Service  (GDS)  provides  directory  service  for  the  entire  DCS 
instance,  relating  to  relatively  permanent  information  such  as  location  of  DCS  servers 
and  user  addresses.  Unlike  other  services,  GDS  is  active  in  only  a  few  of  the  servers. 

8.  Notification  Service  (NTF)  contains  notification  mechanisms  for  use  both  within 
and  across  servers.  A  number  of  options  for  aspects  such  as  method  of  notification 
and  reporting  period,  are  available. 

9.  Secure  Communication  Service  (SCS)  enables  authenticity,  integrity  and  confi- 
dentiality in  communication  with  DCS  servers. 

Figure  C.l  shows  the  layering  of  a  DCS  server.  Though  this  represents  most  interactions 
between  services,  there  may  be  some  that  do  not  follow  this  structure. 


APPENDIX  D 
DCS  CONTROL  SERVICE  FUNCTIONS 

The  following  is  the  complete  list  of  functions  that  implement  the  DCS  operations 
covered  in  this  dissertation.  All  of  them  belong  to  the  CTL  service. 

•  General  Functions 

-  ctlJnit.dcs  :  Initialize  all  DCS  services. 

-  ctlJnit.client  :  Initialize  all  DCS  services  for  a  client. 

-  ctlJnit  :  Initialize  Control  service. 

-  ctLcmddir  :  Get  directory  containing  executable. 

-  ctLgethostbyname  :  Resolve  a  host  or  DCS  site  name. 

-  ctLgetdomainid  :  Return  id  of  local  DCS  domain. 

-  ctLgetdomainname  :  Get  name  of  local  DCS  domain. 

-  ctl_hostJink  :  Set  up  link  to  a  host. 

-  ctl_sync  :  Sync  CTL  state  to  stable  storage. 

-  ctl_xdrproc  :  XDR  filter  for  all  DCS  marshalling. 

-  ctl_status  :  Return  status  of  Control  service. 

-  ctLexit.dcs  :  Terminate  DCS  process  cleanly. 

•  Ring  Functions 

-  ctljingJnit  :  Initialize  CTL  ring  protocol. 

85 


-  ctl_ring_primary  :  Find  id  of  primary  server. 

-  ctl_ring-pulse  :  Send  primary  pulse  to  secondaries. 

-  ctl_ring_process  :  Process  an  RPC  request. 

-  ctl_ring_bcast  :  Broadcast  a  message  to  other  servers  in  ring. 

•  Group  Functions 

-  ctl_group_create  :  Create  group  of  site  links. 

-  ctl.group.add  :  Add  a  site  link  to  group. 

-  ctl_group_delete  :  Delete  a  site  link  from  group. 

-  ctl_group_members  :  Return  current  links  in  group. 

-  ctl-group-destroy  :  Destroy  group  and  all  its  links. 

-  ctLgroup.bcast  :  Broadcast  to  sites  connected  by  group  links. 

•  Site  Functions 

-  ctl_siteJink  :  Set  up  link  to  a  site. 

-  ctl_site_unlink  :  Teardown  link  to  a  site. 

-  ctl_site.call  :  Make  an  RPC  to  a  site. 

•  Thread  Functions 

-  ctLthread init  :  Initialize  thread  library 

-  ctl_thread_create  :  Create  a  thread. 

-  ctl_thread_self  :  Get  identity  of  calling  thread. 

-  ctLthread.exit  :  Exit  of  a  thread. 
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-  ctl_thread_signal  :  Arrange  to  catch  a  signal  sent  to  thread. 

-  ctLthread-kill  :  Send  specified  signal  to  thread. 

-  ctl_mutexjnit  :  Prepare  a  mutex  variable. 

-  ctl_mutex_lock  :  Blocking  mutex  lock  (like  semaphore  down). 

-  ctl-mutex-trylock  :  Non-blocking  mutex  lock. 

-  ctl_mutex_unlock  :  Mutex  unlock  (like  semaphore  up). 

-  ctl-mutex.destroy  :  Destroy  the  mutex  variable. 

•  Error  Functions 

-  ctl_error_perror  :  Print  a  diagnostic  error  message. 

-  ctl_error_message  :  Build  a  diagnostic  error  message. 

-  ctLerrorjiotify  :  Notify  DCS  administrator  about  an  error. 

We  now  give  the  complete  man  pages  for  some  of  these  functions. 


CTL_INIT_DCS(3)  DCS  Functions  CTL.INIT.DCSC3) 

NAME 

ctl_init_dcs  -  initialize  DCS 

SYNOPSIS 

int  ctl_init_dcs(argv) 
char  *argv[]  ; 

♦include  <dcs.h> 

DESCRIPTION 

ctl_init_dcs()  intializes  all  DCS  services.  It  in  turn  calls 

ctl_init(3),  dbs.init(3),    cnf_init(3),    acs.init(3), 

scs.init(3),  dss_init(3),  ntf.init(3),  app_init(3),   and 

gds_init(3)  in  that  order.  ctl_init_dcs()  should  be  called 

before  any  other  DCS  function  is  called  or  variables 
accessed. 

Normally,  options  specific  to  a  DCS  service  are  specified  on 
the  command-line  and  passed  through  argv. 

SEE  ALSO 

acs_init(3),  app_init(3),  cnf_init(3),  ctl_error_perror(3) , 
ctl_init(3),  dbs.init(3),  dss_init(3),  gds.init(3), 
ntf.init(3),  scs_init(3) 

RETURN  VALUES 

ctl.init.dcsO  returns  OK  on  success  and  an  error  value  on 
failure.  ctl_error_perror(3)  can  be  used  to  print  a  diag- 
nostic message  using  this  error  value. 

BUGS 

Leaving  out  argv  while  calling  ctl.init.dcsO  can  cause  a 
memory  fault. 

DCS  Release  2. Obi  Last  change:  11  December  199S  1 
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CTL.INIT(3)  DCS  Functions  CTL.IHITO) 

NAME 

ctl.init  -  initialize  DCS  CTL  service 

SYNOPSIS 

int  ctl_init(argv) 
char  *argv[]  ; 

#include  <dcs.h> 

DESCRIPTION 

ctl.initO  intializes  the  DCS  CTL  service.  No  CTL  function 
should  be  called  before  the  service  is  initialized.  Nor- 
mally, ctl_init_dcs(3)  calls  ctl.initO. 

Options  specific  to  the  CTL  service  can  be  specified  on  the 
command-line  and  passed  through  argv. 

SEE  ALSO 

ctl_error_perror(3),  ctl_init_dcs(3) ,  ctl_status(3) 

RETURN  VALUES 

OK  Initialization  successful. 

ECTLDCSPATH  DCS  system  root  could  not  be  found.  This  usu- 
ally happens  if  the  executable  is  not  in  the 
bin/  directory  of  the  DCS  system  root . 

ERRORS 

ctl.error.perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  this  error  value. 

BUGS 

Leaving  out  argv  while  calling  ctl.initO  can  cause  a  memory 
fault . 

DCS  Release  2. Obi  Last  change:  31  October  1995  1 
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CTL.GETHOSTBYNAHEO) 
NAME 


DCS  Functions 


CTL.GETHOSTBYNAHEO) 


ctl.gethostbyname  -  resolve  host  or  DCS  site  name 

SYNOPSIS 

int  ctl_gethostbyname(name) 
char  *name; 

# include  <netdb.h> 
#include  <dcs.h> 

DESCRIPTION 

ctl.gethostbynameO  returns  a  pointer  to  the  following 
structure  containing  data  from  either  DCSROOT/etc/hosts  or 
/etc/hosts. 


struct  hostent 
{ 

char  *h_name; 

char  **h_ aliases; 

int  h_addrtype; 

int  h.length; 

char  **h_addr  list; 

}; 


/*  Official  name  */ 
/*  Alias  list  */ 
/*  Address  type  */ 
/*  Length  of  address  */ 
/*  List  of  addresses  */ 


The  members  of  this  structure  are  as  follows. 


h_name 


Official  name  of  the  host  or  site. 
If  it  is  a  site,  this  is  same  as 
name. 


h.aliases  A  NULL  terminated  array  of  alter- 

nate names  for  name. 

h_addrtype  The  type  of  address  being  returned, 
currently  always  AF.INET. 

h.length  The  length  of  address  in  bytes. 

h.addr.list  A  pointer  to  a  NULL  terminated 
array  of  network  addresses  for 
name.  If  name  is  a  DCS  site,  the 
addresses  represent  the  hosts  run- 
ning DCS  servers  at  the  site;  oth- 
erwise, they  correspond  to  name 
itself.  All  addresses  are  returned 
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in  network  byte  order. 

ctl.gethostbynameO  first  looks  up  DCSROOT/etc/hosts  for 
site  or  host  information  on  name.  If  not  found,  it  invokes 
gethostbynameON) . 

DCS  Release  2. Obi  Last  change:  8  November  1995  1 

CTL.GETHOSTBYNAMEO)      DCS  Functions      CTL.GETHOSTBYNAMEO) 

FILES 

DCSROOT/etc/hosts   DCS  hosts  file 

/etc/hosts         System  hosts  file 

SEE  ALSO 

ctl.error.perrorO) ,   ctl.site.linkO) ,  gethostbynameON) 

RETURN  VALUES 

ctl.gethostbynameO  returns  a  pointer  to  struct  hostent 
above  on  success  and  NULL  on  failure. 

BUGS 

For  consistency  with  gethostbynameON),  ctl.gethostbynameO 
returns  only  NULL  to  indicate  an  error.  So,  unlike  most 
other  DCS  functions,  ctl.error.perrorO)  cannot  be  used  for 
diagnosis. 

DCS  Release  2. Obi  Last  change:  8  November  199S  2 
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CTL.SYNCC3)  DCS  Functions  CTL_SYNC(3) 

NAME 

ctl.sync  -  sync  DCS  CTL  service  state 

SYNOPSIS 

void  ctl_sync() 

•include  <dcs.h> 

DESCRIPTION 

ctl.syncO  saves  the  state  of  CTL  service,  in  stable 
storage.  This  excludes  its  databases  since  the  DBS  service 
saves  all  databases  during  dbs_sync(3). 

CTL  service  calls  ctl_sync(3) ,  dbs_sync(3) ,  cnf_sync(3), 
acs_sync(3),  scs_sync(3) ,  dss.sync(3) ,  ntf_sync(3), 
app_sync(3),  and  gds_sync(3)  in  that  order  every  SYNCMON- 
TICKS  *  HONTICKTIME  seconds  to  checkpoint  the  service 
states. 

SEE  ALSO 

acs.sync(3),  app_sync(3),    cnf _sync(3) ,    ctl.sync(3), 

dbs_sync(3),  dss_sync(3),  gds_sync(3),  ntf _sync(3) , 
scs_sync(3) 

BUGS 

The  time  between  calls  to  ctl_sync()  may  exceed  the  normal 
period. 
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NAME 

ctl.status  -  give  current  status  of  the  CTL  service  of  DCS 

SYNOPSIS 

int  ctl.status (status,  status.size) 
char  *status; 
size_t  status_size; 

#include  <dcs.h> 

DESCRIPTION 

ctl.statusQ  returns  the  current  status  of  the  CTL  service. 
If  status  is  not  a  NULL  pointer  and  status.size  is  a  posi- 
tive integer,  a  character  string  representing  the  current 
CTL  service  status  and  not  exceeding  status.size  in  length, 
is  written  to  status. 

ctl_status()  is  generally  useful  for  fault  diagnosis. 

SEE  ALSO 

ctl_error_perror(3) 

RETURN  VALUES 

OK  On  success. 

ECTLINVAL      If  status  is  NULL  or  status.size  is  not  posi- 
tive. 

ERRORS 

Diagnostic  message  can  be  printed  with  ctl.error_perror(3) . 

BUGS 

status  must  have  at  least  status.size  bytes  space  allocated. 
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NAME 

ctl_exit_dcs  -  terminate  a  DCS  process  cleanly 

SYNOPSIS 

void  ctl_exit_dcs (status) 
int  status 

((include  <dcs.h> 

DESCRIPTION 

ctl_exit_dcs()  enables  a  DCS  client  or  server  process  to 
make  a  clean  exit.  Most  importantly,  it  saves  all  service 
states  to  stable  storage  so  that  they  become  permanent 
before  the  process  exits.  To  terminate,  it  calls  exit (2) 
with  status. 

ctl_exit_dcs()  should  be  called  in  lieu  of  exit (2)  to  ensure 
a  safe  termination. 

SEE  ALSO 

acs.syncO),  app_sync(3),  cnf  _sync(3) ,  ctl_sync(3), 
dbs_sync(3),  dss.sync(3),  gds_sync(3),  ntf_sync(3), 
scs_sync(3) 
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NAME 
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CTL_GR0UP_CREATE(3) 


ctl_group_create,  ctl_group_add,  ctl_group_delete, 
ctl_group_members ,  ctl_group_destroy  -  handle  group  of  links 
to  DCS  servers 

SYNOPSIS 

int  ctl_group_create() 

int  ctl_group_add(group,  link) 
int  group ; 
int  link; 

int  ctl_group_delete (group,  link) 
int  group; 
int  link; 

int  ctl_group_members (group,  list,  list.size) 
int  group; 
int  list  []  ; 
int  list_size; 

int  ctl_group_destroy(group) 
int  group; 

#include  <dcs.h> 

DESCRIPTION 

ctl.group.createO  creates  a  group  of  links  to  DCS  servers. 
When  created,  a  group  will  not  contain  any  links,  but  they 
can  be  added  with  ctl_group_add()  and  deleted  with 
ctl_group_delete()  at  any  time,  ctl.group .members ()  gives 
the  current  links  of  the  group  and  ctl_group_destroy()  elim- 
inates the  group. 

ctl_group_members()  writes  the  current  link  numbers  in  array 
list,  up  to  a  maximum  of  list.size. 

The  primary  function  of  a  link  group  is  to  make  replicated 
Remote  Procedure  Call  (RPC)  to  the  DCS  servers  connected  by 
the  links  of  the  group.  E.g.,  all  the  servers  at  sites  par- 
ticipating in  a  conference  can  be  reached  through  the  links 
of  a  group  for  the  conference.  ctl.group_bcast(3)  should  be 
used  to  make  the  RPC. 


SEE  ALSO 


ctl_error_perror(3) ,  ctl_group_bcast(3) 

RETURN  VALUES 

ctl_group_create()  returns  an  integer  representing  the  new 
group  or  ECTLKDGROUPS  if  no  more  groups  can  be  created. 
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ctl_group_add()  returns  OK  if  link  can  be  added  or  one  of 
the  following  errors  on  failure. 

ECTLBADGROUP   Bad  group. 

ECTLNOMEM      Out  of  memory. 

ECTLDUPLINK    link  already  in  group. 

ctl_group_delete()  returns  OK  on  success  or  one  of  the  fol- 
lowing errors  otherwise. 

ECTLBADGROUP   Bad  group. 

ECTLNOLINK     link  not  in  group. 

ctl_group_members  O  returns  the  number  of  links  written  to 
list  or  an  error  as  follows. 

ECTLBADGROUP   Bad  group. 

ECTLINVAL      Bad  list  or  list. size. 

ctl_group_destroy()  returns  OK  if  group  can  be  removed  or 
ECTLBADGROUP  if  group  is  bad. 

ERRORS 

ctl_error_perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  the  error  value  returned. 
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NAME 

ctl.group.bcast  -  make  broadcast  RPC  to  DCS  servers 

SYNOPSIS 

int  ctl_group_bcast (group,  proc,  xdr.in,  inp,  xdr.out,  outp) 

int  group; 

u_long  proc; 

xdrproc.t  xdr.in; 

addr_t  inp ; 

xdrproc.t  xdr_out; 

addr_t  outp; 

♦include  <dcs.h> 

DESCRIPTION 

ctl_group_bcast()  makes  a  broadcast  RPC  to  DCS  servers  con- 
nected by  links  of  group,  proc  is  the  procedure  to  be 
invoked  in  the  remote  servers.  Input  to  the  RPC  is  passed  as 
inp  and  the  output  obtained  through  outp.  xdr.in  and  xdr.out 
are  the  XDR(3N)  filters  for  input  and  output  respectively. 

E.g.,  procedure  proc  at  all  the  servers  of  sites  participat- 
ing in  a  conference  can  be  invoked  by  calling 
ctl.group.bcastO  using  group  for  the  conference  containing 
links  to  these  servers. 

Meaning  of  an  output  from  ctl_group_bcast()  depends  on  proc. 
E.g.,  the  first  output  received  from  a  server  may  be 
returned  through  outp. 

SEE  ALSO 

ctl_error_perror(3) ,  ctl_group.create(3) 

RETURN  VALUES 

ctl_group_bcast()  returns  OK  on  successful  broadcast  and  one 
of  the  following  error  values  on  failure. 

ECTLBADGROUP   Bad  group. 

ECTLGRPEMPTY   No  sites  in  group. 

ERRORS 

ctl_error_perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  this  error  value. 
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BUGS 

At  the  moment,  outputs  from  RPCs  are  ignored  completely. 
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NAME 

ctl.site.link,  ctl_site_unlink  -  handle  links  to  DCS  sites 

SYNOPSIS 

int  ctl_site_link(site) 
char  *site; 

int  ctl_site_unlink(link) 
int  link; 

#include  <dcs.h> 

DESCRIPTION 

ctl_site_link()  establishes  a  communication  link  to  the  set 
of  DCS  servers  at  site.  Location  of  the  servers  for  a  site 
is  represented  in  the  DCSROOT/ etc/hosts  file.  After  a  suc- 
cessful linkup,  ctl_site_call(3)  can  be  used  to  make  an  RPC 
to  the  site. 

Links  can  also  be  added  to  groups  with  ctl_group_add(3)  and 
a  broadcast  RPC  made  to  the  connected  sites  with 
ctl_group_bcast(3) . 

ctl_site_unlink()  tears  down  the  connection  of  link. 

FILES 

DCSROOT/etc/hosts   DCS  file  to  resolve  host  and  site  names 

SEE  ALSO 

ctl_error.perror(3) ,  ctl_gethostbyname(3) ,  ctl_group.add(3) , 
ctl_group_bcast(3) ,  ctl_site_call(3) 

RETURN  VALUES 

ctl.site_link()  returns  an  integer  representing  the  new  link 
or  one  of  the  following  error  values . 

ECTLBADSITE  Bad  site  name.  Site  information  is  usu- 
ally present  in  the  DCSROOT/etc/hosts 
file  and  this  error  means  there  is  no 
entry  for  site. 

ECTLNOSVC  No  server  found  at  site.  Host  often, 
this  happens  if  there  is  no  DCS  server 
running  at  any  of  the  hosts  allotted  to 
site. 
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ECTLNOLINKS    No  links  left. 
ctl_site_unlink()  returns  the  following. 

OK  Link  successfully  broken  down. 
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ECTLBADLINK    Link  is  invalid. 

ERRORS 

ctl_error_perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  the  error  value  returned. 

BUGS 

At  the  moment,  only  one  of  the  servers  at  site  is  used  for 
linkup. 
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NAME 

ctl_site_call  -  make  an  RPC  to  a  DCS  site 

SYNOPSIS 

int  ctl_site_call(link,  proc,  xdr.in,  inp,  xdr.out,  outp) 

int  link; 

u_long  proc; 

xdrproc.t  xdr_in; 

addr_t  inp; 

xdrproc.t  xdr_out; 

addr.t  outp; 

♦include  <dcs.h> 

DESCRIPTION 

ctl_site_call()  makes  an  RPC  to  the  DCS  site  connected  by 
link,  proc  is  the  procedure  to  be  invoked  in  the  remote 
server.  Input  to  the  RPC  is  passed  as  inp  and  the  output 
obtained  through  outp.  xdr.in  and  xdr.out  are  the  XDR(3N) 
filters  for  input  and  output  respectively. 

link  should  be  the  value  returned  by  ctl.site_link(3)  on  a 
successful  connection  to  the  DCS  site. 

ctl.group.bcast(3)  uses  ctl_site_call()  to  make  its  broad- 
cast RPC. 

SEE  ALSO 

ctl_error_perror(3) ,  ctl_group.bcast(3) ,  ctl_site_link(3) 

RETURN  VALUES 

ctl.site.calK)  returns  OK  on  successful  RPC  and  one  of  the 
following  error  values  on  failure. 

ECTLBADLINK  Bad  link. 

ECTLRPCENCD  Can't  encode  input. 

ECTLRPCDECD  Can't  decode  output. 

ECTLRPCSEND  Failed  while  sending. 

ECTLRPCRECV  Failed  while  receiving. 

ECTLRPCTOUT  RPC  timed  out. 
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ECTLRPCINTR  RPC  interrupted. 

ECTLRPCVER  RPC  version  incompatible. 

ECTLRPCAUTH  RPC  authentication  error. 
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ECTLRPCPROG  Program  not  available. 

ECTLRPCPVER  Program  version  mismatch. 

ECTLRPCPROC  Procedure  unavailable. 

ECTLRPCDCDA  Can't  decode  arguments. 

ECTLERROR  Other  error. 

ERRORS 

ctl_error.perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  this  error  value. 


BUGS 


Currently,  a  link  represents  connection  with  only  one 
server,  so  all  servers  at  site  are  not  contacted. 
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NAME 

ctl_thread_init  -  initialize  DCS  CTL  thread  library 

SYNOPSIS 

int  ctl_thread_init() 

#include  <dcs.h> 

DESCRIPTION 

ctl_thread.init()  intializes  the  DCS  CTL  thread  library.  It 
is  an  error  to  call  any  other  thread  function  before  the 
library  is  initialized  with  ctl_thread_initQ  .  Normally, 
ctl_init(3)  will  call  ctl.thread.initO  . 

SEE  ALSO 

ctl.error_perror(3) ,  ctl.init(3) 

RETURN  VALUES 

ctl_thread_init()  always  returns  OK. 
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NAME 

ctl_thread_create  -  create  a  DCS  thread 

SYNOPSIS 

int  ctl_thread_create(tidp,  func,  args) 
ctl_thread_t  *tidp; 
void  *(*func)(); 
void  *args ; 

((include  <dcs.h> 

DESCRIPTION 

ctl_thread_create()  creates  a  DCS  thread  within  a  process. 
The  new  thread  begins  execution  as  if  the  function  func  has 
just  been  called  with  a  single  argument,  args.  If  tidp  is 
not  NULL,  ctl.thread. create O  writes  the  identity  of  the 
thread  in  the  space  pointed  by  tidp. 

DCS  threads  are  identical  in  most  respects  to  regular  heavy 
weight  processes  and  can  safely  make  blocking  system  calls 
such  as  read(2)  and  call  library  functions  such  as 
sleep(3V).  They  can  also  use  regular  stdio  library  functions 
such  as  printf(3V)  in  a  thread-safe  fashion. 

SEE  ALSO 

ctl_error_perror(3) ,  ctl_thread_exit(3) ,  ctl.thread_kill(3) , 
ctl_thread_self (3) 

RETURN  VALUES 

OK  On  successful  creation. 

ECTLAGAIN  Insufficient  resources. 

ECTLINVAL  Invalid  argument. 

ECTLERROR  Any  other  error. 

ERRORS 

ctl_error_perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  the  error  value  returned. 
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NAME 

ctl_thread_self  -  identify  a  DCS  thread 

SYNOPSIS 

void  ctl_thread_self (tidp) 
ctl.thread.t  *tidp; 

#include  <dcs.h> 

DESCRIPTION 

ctl.thread.self  O  gives  the  identity  of  the  calling  DCS 
thread  through  tidp.  This  is  the  only  way  for  a  thread  to 
know  its  identity. 

SEE  ALSO 

ctl_error.perror(3) ,  ctl_thread_create(3) 
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NAME 

ctl.thread.exit  -  terminate  a  DCS  thread 

SYNOPSIS 

void  ctl_thread_exit() 

#include  <dcs.h> 

DESCRIPTION 

ctl_thread_exit()  terminates  the  calling  DCS  thread,  clean- 
ing up  as  necessary. 

If  a  thread  returns  from  its  starup  function  (see 
ctl.thread. create (3)) ,  an  implicit  call  is  made  to 
ctl.thread.exitO.  If  the  calling  thread  is  the  last  in  the 
process,  the  process  terminates  as  if  a  call  Has  made  to 
exit (3) . 

SEE  ALSO 

ctl_error.perror(3) ,  ctl.thread.create(3) , 

ctl_thread_kill(3) 
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NAME 

ctl_thread_signal  -  set  to  catch  signal  sent  to  DCS  thread 

SYNOPSIS 

int  ctl_thread_signal(sig,  func) 

int  sig; 

void  (*func)Q  ; 

#include  <signal.h> 
#include  <dcs.h> 

DESCRIPTION 

ctl.thread.signalQ  arranges  to  dispatch  signal  sig  sent  to 
a  DCS  thread,  with  function  func.  If  the  calling  thread 
receives  sig  after  a  successful  call  to  ctl_thread_signal() , 
the  thread  will  be  interrupted  and  func  called  with  sig  as 
the  single  argument. 

Signals  can  be  sent  to  threads  using  ctl_thread_kill(3) . 

SEE  ALSO 

ctl_error_perror(3) ,  ctl_thread_kill(3) 

NOTES 

Typically,  func  will  be  declared  as 

void  func(sig) 
int  sig; 

RETURN  VALUES 

OK  On  success. 

ECTLERROR  On  failure. 

ERRORS 

ctl_error.perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  the  error  value  returned. 

BUGS 

Error  values  returned  by  ctl.thread.signalQ  do  not  generate 
very  useful  diagnostic  messages. 
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NAME 

ctl_thread_kill  -  send  a  signal  to  a  DCS  thread 

SYNOPSIS 

int  ctl_thread_kill(tid,  sig) 
ctl_thread_t  tid; 
int  sig; 

#include  <signal.h> 
♦include  <dcs.h> 

DESCRIPTION 

ctl_thread_kill()  sends  signal  sig  to  the  DCS   thread 
represented  by  tidp. 

ctl_thread_signal(3)  can  be  used  by  individual  threads  to 
arrange  to  catch  signals. 

SEE  ALSO 

ctl_error_perror(3) ,  ctl.thread_create(3) , 

ctl_thread_exit(3) ,  ctl_thread_signal(3) 

RETURN  VALUES 

OK  On  success. 

ECTLERROR  On  failure. 

ERRORS 

ctl_error_perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  the  error  value  returned. 

BUGS 

Error  values  returned  by  ctl.thread.signalO  do  not  generate 
very  useful  diagnostic  messages. 

DCS  Release  2. Obi  Last  change:  31  October  1995  ! 


109 


CTL_MUTEX_INIT(3) 


DCS  Functions 
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NAME 


ctl_mutex_init , 
ctl.mutex.unlock , 
DCS  threads 


ctl.mutex.lock,      ctl_mutex_trylock, 
ctl_mutex_destroy  -  mutex  operations  for 


SYNOPSIS 

int  ctl_mutex_init(midp) 
ctl_mutex_t  *midp ; 

int  ctl_mutex_lock(midp) 
ctl.mutex.t  *midp; 

int  ctl_mutex_trylock(midp) 
ctl_mutex_t  *midp; 

int  ctl.mutex.unlock  (midp) 
ctl.mutex.t  *midp; 

int  ctl.mutex. destroy (midp) 
ctl.mutex.t  *midp; 

#include  <dcs.h> 

DESCRIPTION 

ctl.mutex.initO,  ctl.mutex.lockO ,  ctl.mutex.trylockO, 
ctl_mutex_unlock() ,  and  ctl_mutex_destroy()  are  operations 
for  mutual  exclusion  among  DCS  threads.  ctl.mutex.initO 
prepares  the  mutex  variable  midp  for  the  various  operations 
and  should  be  called  first.  ctl.mutex.lockO  locks  midp  or 
blocks  if  it  is  already  locked.  ctl.mutex.trylockO  is 
similar  to  ctl.mutex.lockO,  except  that  it  does  not  block 
if  midp  is  found  locked.  ctl.mutex.unlockO  unlocks  midp 
and  consequently  unblocks  all  the  threads  blocked  on  midp. 
ctl_mutex_destroy()  eliminates  the  attributes  of  midp  and 
renders  it  unusable  -  ctl.mutex.initO  will  have  to  be 
called  again  before  midp  can  be  used  for  mutex  operations. 


If 


more 


than 


one  thread  is  unblocked  by  a 
ctl.mutex.unlockO,  the  scheduler  decides  which  thread  will 
run  next. 


SEE  ALSO 

ctl_error_perror(3) ,  ctl. thread. create (3) 

RETURN  VALUES 
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ctl_mutex_init(),  ctl_mutex_lock() ,  ctl_mutex_trylock() , 
ctl_mutex_unlock(),  and  ctl_mutex_destroy()  all  return  OK  on 
success  and  one  of  the  following  errors  on  failure . 

ctl_mutex_init() 

ECTLAGAIN      Lack  of  resources  (other  than  memory)  to 
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build  the  mutex  variable . 

ECTLNOMEM      Insufficient  memory  to  create  a  mutex 
variable. 

ctl_mutex_lock() 

ECTLINVAL      Invalid  mutex  variable. 

ECTLDEADLK     A  deadlock  would  occur  if  mutex  is 
locked. 

ctl_mutex_trylock() 

ECTLBUSY      Mutex  is  already  locked. 

ECTLINVAL      Invalid  mutex  variable. 

ECTLDEADLK     A  deadlock  would  occur  if  mutex  is 
locked. 

ctl_mutex_unlock() 

ECTLINVAL      Invalid  mutex  variable. 

ECTLPERH      Calling  thread  does  not  own  the  mutex. 

ctl_mutex_destroy() 

ECTLBUSY      The  mutex  variable  is  locked  or  under 
use  by  another  thread. 

ECTLINVAL      Invalid  mutex  variable. 

All  these  functions  return  ECTLERROR  for   non-specific 


Ill 


ERRORS 

ctl_error_perror(3)  can  be  used  to  print  a  diagnostic  mes- 
sage using  the  error  value  returned. 
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NAME 

ctl_error_perror  -  print  DCS  error  message 

SYNOPSIS 

void  ctl_error_perror(msg,  errno) 
char  *msg; 
int  errno ; 

#include  <dcs.h> 

DESCRIPTION 

ctl_error_perror()  prints  a  diagnostic  error  message  on  the 
standard  error  corresponding  to  errno.  If  msg  is  not  a  NULL 
pointer,  msg  is  printed  first,  followed  by  a  colon,  a  space, 
and  then  the  error  message. 

BUGS 

If  there  is  no  error  string  corresponding  to  errno, 
ctl_error_perror()  may  cause  a  memory  fault. 
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